[Image filenames are relative to https://k3.botanicus.net/tmp/snap/ since the forums prevent me from posting images or links!]
Hi there,
I haven’t touched Linux on the desktop much since ~2007 and was very excited to try out snaps. Ubuntu’s integration is a really wonderful experience – it’s hard to spot the rough edges because as best I can tell they don’t exist. With one exception…
On a 32GB RAM 8 thread laptop with Samsung 970 Evo NVMe disk, somehow LibreOffice startup performance is perceptually on a par with spinning rust, and in the case of warm page cache, clearly significantly worse than spinning rust. On investigating how this is possible it became apparent Squashfs is the smoking gun, and after further investigations I’m left puzzled by how a filesystem intended for resource-constrained embedded systems has become deeply ‘embedded’ in Snap’s design, where it plays the role of constraining resources on much bigger machines!
During sleuthing I found Ubuntu bug #1636847 and was surprised to learn that in order to cope with pushing squashfs into a lesser-explored realm where it burns memory, Ubuntu kernels as of 18.10 (>2 years later!) are built with squashfs decompression single-threaded with respect to each filesystem. No wonder the computer feels like spinning rust while loading LibreOffice - the most expensive squashfs computation is serialized across all cores, and I expect this implies the NVMe drive’s parallelism is also being completely wasted.
Test Setup
I made some profiles of LibreOffice startup in various configurations. Each invocation uses libreoffice macro:///
to run a standard library macro that opens a Writer document, waits 500ms, then exits the program.
The cold boot case is with sync; echo 3 > /proc/sys/vm/drop_caches
, while the warm boot case is the same invocation run immediately after a cold boot measurement. The scripts to recreate these measurements are included at the bottom.
Real IO counts were measured separately with iostat /dev/nvme0n1
.
Test system is Ubuntu 18.10, Dell XPS 9550
Cold Boot (LibreOffice Apt)
Performance counter stats for 'system wide':
2483.663066 task-clock (msec) pstat-15446 # 0.768 CPUs utilized
5,841 context-switches pstat-15446 # 0.002 M/sec
47 cpu-migrations pstat-15446 # 0.019 K/sec
48,623 page-faults pstat-15446 # 0.020 M/sec
6,885,944,560 cycles pstat-15446 # 2.772 GHz (99.60%)
9,685,864,191 instructions pstat-15446 # 1.41 insn per cycle (99.60%)
2,044,567,580 branches pstat-15446 # 823.207 M/sec (99.60%)
32,418,805 branch-misses pstat-15446 # 1.59% of all branches (99.60%)
3.232673706 seconds time elapsed
2.23user 0.35system 0:03.34elapsed 77%CPU (0avgtext+0avgdata 239556maxresident)k
554248inputs+3024outputs (1847major+47931minor)pagefaults 0swaps
This is 3.2 seconds to fault in 272MiB of disk, open a document, wait half a second, then fully close. This is about the performance I’d expect from a NVMe machine.
Warm Boot (LibreOffice Apt)
Performance counter stats for 'system wide':
2307.322024 task-clock (msec) pstat-15527 # 0.954 CPUs utilized
1,599 context-switches pstat-15527 # 0.693 K/sec
48 cpu-migrations pstat-15527 # 0.021 K/sec
46,829 page-faults pstat-15527 # 0.020 M/sec
6,670,519,294 cycles pstat-15527 # 2.891 GHz (99.85%)
9,352,629,215 instructions pstat-15527 # 1.40 insn per cycle (99.85%)
1,982,135,931 branches pstat-15527 # 859.063 M/sec (99.85%)
31,260,595 branch-misses pstat-15527 # 1.58% of all branches (99.85%)
2.419333197 seconds time elapsed
2.14user 0.23system 0:02.49elapsed 95%CPU (0avgtext+0avgdata 239532maxresident)k
0inputs+2264outputs (0major+47942minor)pagefaults 0swaps
A slight improvement from previously - 2.4 seconds, no IO-triggering major faults, and iostat reports 367KiB read from disk as might be expected.
Cold Boot (LibreOffice Snap)
And now for current Snap…
Performance counter stats for 'system wide':
11770.091563 task-clock (msec) pstat-15075 # 0.996 CPUs utilized
19,659 context-switches pstat-15075 # 0.002 M/sec
291 cpu-migrations pstat-15075 # 0.025 K/sec
70,871 page-faults pstat-15075 # 0.006 M/sec
37,918,835,347 cycles pstat-15075 # 3.222 GHz (99.79%)
50,906,529,193 instructions pstat-15075 # 1.34 insn per cycle (99.79%)
7,504,717,974 branches pstat-15075 # 637.609 M/sec (99.79%)
472,008,620 branch-misses pstat-15075 # 6.29% of all branches (99.79%)
11.816502575 seconds time elapsed
4.68user 7.20system 0:11.89elapsed 99%CPU (0avgtext+0avgdata 289080maxresident)k
395876inputs+3000outputs (2455major+69633minor)pagefaults 0swaps
11.8 seconds to pull in 222MiB (despite the compression barely any improvement over plain uncompressed ext4), but CPU has exploded - almost 6x as many cycles compared to plain filesystem. Thank goodness I’m plugged into AC!
Warm Boot (LibreOffice Snap)
But this is a cost only paid on cold cache, right?
Performance counter stats for 'system wide':
5017.110589 task-clock (msec) pstat-15259 # 1.027 CPUs utilized
2,727 context-switches pstat-15259 # 0.544 K/sec
103 cpu-migrations pstat-15259 # 0.021 K/sec
68,444 page-faults pstat-15259 # 0.014 M/sec
15,923,433,894 cycles pstat-15259 # 3.174 GHz (99.90%)
20,389,071,208 instructions pstat-15259 # 1.28 insn per cycle (99.90%)
3,764,918,161 branches pstat-15259 # 750.416 M/sec (99.90%)
43,101,749 branch-misses pstat-15259 # 1.14% of all branches (99.90%)
4.883716615 seconds time elapsed
4.76user 0.35system 0:04.95elapsed 103%CPU (0avgtext+0avgdata 295588maxresident)k
8inputs+2432outputs (0major+69637minor)pagefaults 0swaps
We still manage to burn 4.8 seconds pulling in 738KiB of disk, and look at the cycles – almost 3x worse than cold cached regular filesystem, and still almost half the work involved in a cold cache squashfs run. Even despite paying the battery cost of having gobs of RAM for cache in a laptop, still the CPU will burn brightly every time a Snap application is opened.
Squashfs Confirmation
I used perf record
to try and confirm the extra runtime was spent in squashfs, but the results weren’t entirely conclusive: for example the Snap traces show much higher futex()
time, but this may just be due to the extra delays introduced in the run.
Here is the Snap cold boot case recorded using prec.sh
, clearly showing Squashfs responsible for almost all the runtime: snap-cold-boot-perf.png
Despite caching, Snap warm boot still shows decompression accounting for >11% of samples: snap-warm-boot-perf.png
Finally for completeness a cold boot run of the Apt package, note how filemap_fault
is active only 1.70% of the time compared to 38.8% in the Snap run, despite the Apt run completing almost 5x faster: apt-cold-boot-perf.png
Background
According to the snap format documentation, which is the closest I could find to a design document:
This method delivers fast and extremely predictable installations, with no installation remnants, and no way for a snap’s content to be mutated or interfered with. A snap is either installed and available as originally built, or it is not available at all.
It seems the idea is that since the wire transfer format matches that of the storage format on the end system, no extraction step is necessary, and users are less likely to tinker with the opaque squashfs blob downloaded by the system.
This sounds great, except of course, squashfs provides no immutability guarantees, nor was it ever designed to - users can easily mksquashfs modded images, or even more easily bind or union mount their edits on top of snapd’s repository.
Finally it seems very apparent that Snap is optimizing for entirely the wrong thing – software installations happen once, whereas reboots and program restarts happen constantly. Even with page cache assistance, squashfs performance is still more than twice as slow as straight ext4. It is not even the case that Snap is amortizing the cost of some expensive operation over the first program run, it repays the same cost over and over again, making all packages and invocations less responsive and burning battery in perpetuity.
Conclusions / Suggestions
Measured on a macro scale this design worsens usability for everyone - and not just while Snap is running, but at all times when any Snap-packaged software is active.
Squashfs might be a reasonable choice of transfer format, but it was never intended as a storage format in this use case. It might be possible to improve squashfs performance over time - at the expense of more cache or computation (and therefore energy) - but it is not possible to work around a more fundamental misstep:
Compression has existed in desktop operating systems for more than 3 decades, but since the mid 90s as storage densities and transfer rates exploded it has never, ever been the case that compression was a mandatory feature – and for good reason, it almost always hurts performance, as demonstrated here. There are few examples of compressed containers helping runtime on a macro scale, and where they do, it is for bizarre reasons (one example is Python’s ZIP importer, it speeds up some scenarios by avoiding system calls!).
Compression is better left as a choice to the end user, where they can decide if the processing/energy/battery/latency tradeoff is worth it in their particular scenario. Linux has good support for this already at least in the form of btrfs.
Logical immutability is a noble design goal, but it can only ever be considered a logical abstraction. A squashfs blob is a low hanging fruit form of obfuscation at best – it is specifically no better (and IMHO for this application – much worse) than ext4 chattr -i
immutable attributes, or mount --bind -o ro /snap /snap
, schemes which are much easier to implement and perfectly match the design intention of those VFS/filesystem features.
It might be possible that strong integrity checking is a real desire in the future, but the kernel has dedicated subsystems for that developed by some very smart people, for example in the form of dm-verity/fs-verity as employed on Android.
One suggestion would be to trade the complexity of managing many filesystem mounts for a streamy installer – one that extracts e.g. the squashfs image as it arrives off the network, hiding UI latency just as amortizing the decompression cost over every program invocation does today. Even a mid-90s disk could keep up with streamy extraction of a filesystem at typical WAN link speeds that exist today. A similar trick can be played during uninstall – make a journal entry, and asynchronously delete while the user continues working.
A halfway solution is possible too – by making the snapd backend configurable, so that users who care about performance can disable the currently mandatory and extraordinarily wasteful compression.
Thanks!
pstat.sh
Isolate the script process in a new perf_event
cgroup, then run perf stat
selecting processes in that cgroup.
#!/bin/bash
set -ex
cgname=$(echo pstat-$$)
sudo mkdir /sys/fs/cgroup/perf_event/$cgname
echo $$ | sudo tee /sys/fs/cgroup/perf_event/$cgname/cgroup.procs
mv perf.data perf.data.prev || true
iostat /dev/ssd
/usr/bin/time perf stat -ae task-clock,context-switches,cpu-migrations,page-faults,major-faults,minor-faults,cycles,instructions,branches,branch-misses -G $cgname "$@"
iostat /dev/ssd
prec.sh
As above, but record kernel stacks for use with perf report
set -ex
cgname=$(echo pstat-$$)
sudo mkdir /sys/fs/cgroup/perf_event/$cgname
echo $$ | sudo tee /sys/fs/cgroup/perf_event/$cgname/cgroup.procs
mv perf.data perf.data.prev || true
/usr/bin/time perf record --all-kernel -ae cycles -G $cgname "$@"
LibreOffice macro
Saved to the standard macro library of both LibreOffice installations
Sub Main
url = ConvertToURL("/home/dmw/snap/libreoffice/current/test.docx")
oDocument = StarDesktop.LoadComponentFromURL( url, "_blank", 0, Array())
Wait 500
StarDesktop.Terminate
End Sub
LibreOffice Snap command line
./pstat.sh snap run libreoffice macro:///Standard.Module1.Main
./prec.sh snap run libreoffice macro:///Standard.Module1.Main
Version:
Name Version Rev Tracking Publisher Notes
libreoffice 6.1.4.2 100 stable canonical✓ -
LibreOffice Apt command line
./pstat.sh libreoffice macro:///Standard.Module1.Main
./prec.sh libreoffice macro:///Standard.Module1.Main
Version:
ii libreoffice 1:6.1.3-0ubuntu0.18.10.2 amd64 office productivity suite (metapackage)