Layouts: re-mapping snap directories

zyga-snapd · July 28, 2017, 11:41am

Hey

I’d like to describe an upcoming feature of snapd, that was designed on a sprint recently. The keyword is layouts and it will empower snap developers to alter the mount namespace in which their applications execute in ways not possible before.

The general idea is that a snap can statically declare it wants to take some content and put it somewhere in the filesystem. The content can come from the snap itself, from the snap system-wide data. This will, for example, allow us to create the impression that a given directory is mounted in /var/lib or a particular file exists in /etc/. This should allow snap developers to overcome hardships of hard-coded locations common to many libraries and packages.

Unlike interfaces, such as the content interface, layouts cannot be “disconnected”. Anyone that designs a snap can do so taking advantage of the fact that before their code runs the layout will be already in place.
Layous can also evolve over time so subsequent revisions of a snap can change the layout as the need arises.

In subsequent posts I will describe the feature both semantically, from developer point of view, as well as technically to show how it will be implemented. Feel free to join the conversation and ask questions.

zyga-snapd · July 28, 2017, 12:26pm

As a snap author you will have some new syntax you can put at the top-level of your snapcraft.yaml or snap.yaml. Let me start with a simple example that we shall discuss below:

snap.yaml and snapcraft.yaml syntax

name: my-fun-snap
version: 1.0
apps:
  my-fun-snap:
    command: ...
...
layout:
  /usr:
    bind: $SNAP/usr
  /mytmp:
    type: tmpfs
    user: nobody
    group: nobody
    mode: 1777
  /mylink:
    symlink: /link/target

As you can see layout is a top-level construct that describes filesystem entries. Each entry can be of one of the three inferred types: a bind mount, a tmpfs mount or a symlink.

Bind mount

Bind mounts are perhaps the most familiar and the first to arrive soon. The declaration above will hide anything that is usually present in /usr and replace it with what the snap ships in its $SNAP/usr/ directory. The variable $SNAP is expanded to a path such as /snap/my-fun-snap/42/ internally.

If you never played with bind mounts you can experiment on your own system using the mount --bind /source /target command (you can unmount /target later). Keep in mind that bind mounts are not copies or symlinks. Anything that is visible from /source becomes visible from /target. Notably both /source and /target must exist (even if they are just empty directories) as mount needs to attach to an existing file or directory.

If there are other things mounted in /source, for example if /source/data is a mounted USB stick the way the mount --bind command is issued affects what will show up in /target/data. If the bind mount is recursive then all of the things mount that were mounted in /source at the time the command was issued will also show up in /target. On command line this can be done by using mount --rbind which stands for recursive bind.

After the bind is done subsequent mount operations can be done both inside /source and /target and the behavior of those changes depending on the type of sharing that is set up. In simple terms when we mount or unmount /source/data this event can propagate to peer groups (such as /target). The event can propagate both ways (shared) from master to slaves only (master and slave) or not at all (private). This allows complex arrangements where events only propagate in certain directions, or even to things that are not visible from the point of view of the process inhabiting a given mount namespace. You don’t necessarily have to use all of those features but it is worth remembering they exist.

For the curious readers this is described in detail in the Linux kernel documentation https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt (the keyword is “shared subtree”).

The bind layout element will allow only a subset of this complexity, namely recursive, shared mounts. This is the most intuitive behavior that fits most use cases.

Interestingly, the source directory may refer to places other than $SNAP. In fact, most arbitrary paths can be used. There will be some limitations (described below), like /media but in general unless you are trying to be malicious the system should not get in your way.

The target directory can already exist but will be created automatically if necessary. The user, group and mode attributes can be used to define fine-grained details about the created parent directories. The technical approach to creating new entires on top of a read only substrate, such as a squashfs filesystem, will be described in the implementation notes in a subsequent post. From the developer’s and users point of view it will just work.

TempFS

The tmpfs element are also very familiar. They will allow snap developers to put a new, ephemeral, temporary file system at the desired location. The same semantics regarding variable expansion (e.g. $SNAP, $SNAP_COMMON or $SNAP_DATA), creation of parent directories, application of user, group and mode (permissions).

Use of tmpfs can be beneficial to explicitly create empty spaces, e.g. for small amounts of ephemeral data, lock files, pid files or other typical things. Keep in mind that tmpfs is backed by RAM and can be very constrained. At this time there is no syntax to describe the size limits on mounted tmpfs but such keyword may be added later.

Symlink

The symlink element can be used to create a symlink rooted at the designated location and pointing at the specified target. The target doen’t have to exist (the symlink can be broken). The same semantics regarding creation of parent directories, user and group ownership and permissions mentioned earlier applies.

Syminks or bind mounts? What is appropriate for me?

Symlinks and bind-mounts have similar “feeling” but they behave differently. Symlinks are very common and many applications can detect them. In comparison bind mounts are not as easy to detect so they have higher chance of being transparent to the application. The downside is that a mount point (/target) cannot be removed before it is unmounted while many applications can simply unlink a symbolic link easily.

Bind mount gotchas

Lastly bind mounts have some, say, unusual semantics when the source is deleted. Let’s look at a simple example

root@fyke:~# mkdir experiment
root@fyke:~# cd experiment/
root@fyke:~/experiment# mkdir source
root@fyke:~/experiment# mkdir target
root@fyke:~/experiment# mount --bind source target
root@fyke:~/experiment# ls -l
total 8
drwxr-xr-x 2 root root 4096 lip 28 15:44 source
drwxr-xr-x 2 root root 4096 lip 28 15:44 target
root@fyke:~/experiment# cat /proc/self/mountinfo | egrep 'source|target' 
139 25 8:2 /home/zyga/experiment/source /home/zyga/experiment/target rw,relatime shared:1 - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
root@fyke:~/experiment# rmdir target
rmdir: failed to remove 'target': Device or resource busy
root@fyke:~/experiment# rmdir source
root@fyke:~/experiment# cat /proc/self/mountinfo | egrep 'source|target' 
139 25 8:2 /home/zyga/experiment/source//deleted /home/zyga/experiment/target rw,relatime shared:1 - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
root@fyke:~/experiment# touch target/file
touch: cannot touch 'target/file': No such file or directory
root@fyke:~/experiment# umount target
root@fyke:~/experiment# ls -l
total 4
drwxr-xr-x 2 root root 4096 lip 28 15:44 target
root@fyke:~/experiment# rmdir target

When the target is mounted it cannot be removed. This restriction does not apply to source but once removed the target is, well, broken. It’s still a mount point but it no longer exists so we cannot create anything inside it anymore (anything that used to be there would have to be removed before the source would be possible to remove anyway so it is always empty).

wmmihaa · October 11, 2017, 7:31am

When could we expect this feature?

zyga-snapd · October 11, 2017, 7:32am

2.29 is going to be forked today so at earliest you will see parts of this in 2.30, I’ll update this post when something useful lands in edge.

wmmihaa · October 11, 2017, 7:33am

Thanks! Would that mean weeks, or months away?

zyga-snapd · October 11, 2017, 7:33am

Snapd is on a monthly release. I’ll do my best to have everything needed in 2.30

zyga-snapd · October 12, 2017, 6:31am

I’m on a good track to actually land something in 2.29, assuming @mvo agrees and deems it safe.

zyga-snapd · October 19, 2017, 2:36pm

I’d like to give everyone an update on where we are with this feature:

I’m working on two base features:

simple creation of target / source directories for mounts / bind mounts (PR 4008)
control of mode and ownership of created directories (PR 3965)
generalized creation of directories on top of read-only filesystems using overlayfs (Git branch)

The most interesting one is the last one, I will open the PR soon but I’m still working on the undo logic and I want the prerequisite 4008 to land first. Once we have it open we need to examine how viable that is and if poses any security risks. I must say I’m really fond of this approach as it has the potential to simplify everything tremendously and is very elegant on the inside and at runtime.

After this I’d like to explore content interface to update it to the new specification. Using this set of features we should be able to easily create aggregation directories. In parallel we can start exploring using the layout work as all that would be missing is better definition of what is allowed and not allowed and connecting the layout specification with the existing mount backend.

ogra · October 19, 2017, 2:54pm

given that not all kernels have overlayfs (specifically custom customer kernels) and we do not enforce it in our kernel plugin config checks either, will you have a fall-back implementation an application snap using layouts falls back to in this case ?

also, what happened to “overlayfs completely breaks LSM (apparmor/selinux)” which initially made us use bind mounts everywhere, was that fixed ? (@jdstrand ? )

zyga-snapd · October 19, 2017, 3:00pm

I don’t have a fallback yet (I would argue I just barely get the implementation off the ground). The fallback is possible just more ugly and definitely more complex as it involves a very composite undo logic.

As for LSM I think it’s been fixed for SELinux recently, I’ll let @jdstrand comment about Apparmor. I’ll just note that we always use lowerdir=/something/from/squashfs,upperdir=/something/from/tmp. We don’t overlay two writable filesystems in any way. The only thing in the upper dir will be the extra directory node which will be used to create a bind mount to something from regular filesystems (squash or normal writable).

jdstrand · October 19, 2017, 5:44pm

No, overlayfs does not work correctly with AppArmor in all cases. It may work well enough for the described functionality with some later versions of overlayfs but this would have to be extensively tested. I’m very concerned about the approach because while overlayfs was (aiui) introduced in 3.18, it was not designed to be used with LSMs and it underwent a lot of change over the years to get to the point where we are today (which still doesn’t work correctly with all the LSMs). Importantly, the changes required to get it to work with AppArmor are to overlayfs itself, not (just) the AppArmor subsystem. Also, it is unknown at this point how overlayfs will work with stacked LSMs.

I fear the feature will be brittle and a bug factory because the current approach for enablement is pick a kernel, drop the latest apparmor directory in place, run the kernel config checks and tweak here and there. Since overlayfs in 3.18 is different than in 4.14 precisely to address its LSM deficiencies (in part), and because people might try to backport apparmor in the manner described to every 3.18+ kernel, the matrix to verify things work correctly everywhere is large and highly labor-intensive.

The other concern is @ogra’s point that overlayfs is not readily backportable, so snaps that utilize the feature will be unusable on the popular 3.4, 3.10 and 3.14 kernels. It is presumed that overlayfs patches will need to be backported to 3.18+ kernels, but I don’t know how backportable those patches are.

If a fallback solution is going to be required for any kernels, why not simply focus on the fallback that is known to work everywhere?

I’m somewhat surprised in the change of direction. This topic has been discussed several times and the decision was always to not use overlayfs, but all of a sudden that decision seems to have been reversed when nothing in the landscape (backportability, LSMs) has changed appreciably.

Conan_Kudo · October 19, 2017, 6:00pm

If you’re talking about old broken Android kernels, then I would just ignore them. If you’re talking about distro kernels, the only 3.10 kernel I know of in active use is the RHEL kernel, and that one has overlayfs from 4.10 backported into it, as of RHEL 7.4.

ogra · October 19, 2017, 8:22pm

I was mainly talking about BSP kernels from IoT board manufacturers and about already existing kernels in vendor stores that used the kernel plugin to create their snap etc…
I actually didn’t even think about all the other possibilities and corner cases that Jamie pointed out here. And indeed there are many other distro kernels we’d have to think about too…

zyga-snapd · October 20, 2017, 1:15pm

Another small update. While I’m waiting for review for pre-requisites I reached a point where the transparent overlays work correctly with undo and update logic. I’ll focus my day on spread tests that attempt to break this as well as expanding coverage to other distributions.

At this stage I should be able to improve the content interface to do aggregation next week.

bashfulrobot · October 23, 2017, 2:56pm

This looks super interesting to me. I had another post about using linking to do something similar.

morphis · October 24, 2017, 9:05am

@zyga-snapd I am wondering what the way forward with overlayfs is with the valid points @jdstrand raised above. We already have devices in production which as of today satisfy all requirements for snapd but with using overlayfs by default we’re putting new requirement up which will cost time to implement, test etc. for production devices and might not happen. So layouts will stay an optional feature for these devices and we have to refuse installation of snaps using it.

We can’t ignore any of these Android kernels as they are being used on various boards as a base. We don’t talk about std. distribution kernels here but about those coming with a board support package from a silicon vendor.

zyga-snapd · October 24, 2017, 9:21am

This is a valid concern, we need to both support existing devices that may not have a particular feature available as well as support the evolution of the snap platform.

Can you say with certainty that there is a device out there with a kernel that does not support overlayfs? Note that there certainly may be one but I’d like to know if you know about one in particular.

ogra · October 24, 2017, 9:33am

Did i read @jdstrand wrong above or wouldn’t we need the actual recent version backported to whatever kernel version to actually have it work even if it is enabled in the config (which the snapcraft kernel plugin does currently not do nor do the config test scripts require it ) ?

As i understood it only the very latest versions actually work … so you would have to additionaly backport it, regardless of having it “supported” …

zyga-snapd · October 24, 2017, 9:35am

Note that our use case of overlayfs is very specific so we may actually get away with it. We only use it to project tmpfs over a squashfs and only to create directories and only so that those directories can be bind mounted over.

morphis · October 24, 2017, 9:37am

That is what I understood too. Only the latest version works properly with AppArmor. So unless we have a device without confinement overlayfs needs to be backported.

We have devices running kernels down to 3.4 today. There was no work to get a newer overlayfs version integrated into any of these kernels so whatever was in the particular upstream release for overlayfs is available.