Snapd 2.36, snap-confine logic walkthrough

NOTE TO EDITORS
This is still work-in-progress and I will be replacing this post with revised versions (to remove the the TBD sections)

#work #documentation

snapd 2.36, snap-confine logic walkthrough

This post is highly technical. The intended audience is fellow snapd developers and security auditors.

snap-confine mount namespace processing

The following section describes how snap-confine prepares mount namespaces for application executables and hooks to operate in.

Perform global initialisation

We grab a flock-based lock at /run/snapd/lock/.lock, which acts as a global lock, and perform global initialisation:

  1. We make a mount point at the directory /run/snapd/ns by bind-mounting it over itself. We subsequently switch the mount point to private mount event propagation. This means that anything mounted there is not replicated to other mount namespaces. This is a kernel requirement for preserving mount namespace of a running process by bind-mounting the /proc/$pid/ns/mnt somewhere, which is how we preserve initialised mount namespaces across process lifecycle.
  2. We look at the mount event propagation setting for /. If it is not set to shared (which is the default option when systemd is pid-1) then we bind-mount the snap mount directory, either /snap or /var/lib/snapd/snap over itself to create a new mount entry s switched to shared mount event propagation in case / is not mounted with shared event propagation. This is done so that mount namespaces created by un-sharing form the initial mount namespace receive mount events for mounting and un-mounting additional revisions of snaps. This is applicable to Ubuntu 14.04 where systemd is not pid-1 and to system containerisation solutions like LXD where the default sharing is private

Re-associate with pid-1 mount namespace

We compare the our own mount namespace, as seen in /proc/self/ns/mnt and that of the init process (/proc/1/ns/mnt. Starting with Linux 3.8 the mount namespace appears as a symbolic link inhabiting the nsfs filesystem. The target of the symbolic link is the string mnt:[123...789] with the integer varying from one namespace to another. If we are not in the initial mount namespace we perform a setns call to move ourselves there.

We do this so that we can get back to the initial mount namespace if we are invoked from some foreign namespace, such as when invoked from within a snap.

Grab the per-snap lock

We grab another flock based lock at /run/snapd/lock/$SNAP_INSTANCE_NAME.lock. This guarantees that we do not race with other instances of ourselves or with other compatible utilities such as snap-update-ns. We will only release this lock a moment before executing the snap-exec program, which is the next element of the snap-run, snap-confine, snap-exec chain.

Create or join a mount namespace

We will now attempt to join or create a new mount namespace, one designed specifically for the instance of a snap we are attempting to start. There are some complexities we need to consider though. The mount namespace may exist and be just right - in that case we just reuse it. Alternatively it may exist but the base snap may have changed revision since. This is something that we call a stale mount namespace. When that happens we can discard it and start over but only if the mount namespace is not inhabited by any processes. We don’t want to create split perspective where some processes of a given snap see one thing and some see something else so we only choose to discard a stale mount namespace if it is also unused.

Inspect the preserved mount namespace

We start the inspection by opening /run/snapd/ns and keeping the directory file descriptor around for the duration of various other operations. We then proceed to open /run/snapd/ns/$SNAP_INSTANCE_NAME.mnt and both fstat and fstatfs it. The first system call gives us information about the file itself, the second about the filesystem it inhabits.

We inspect the filesystem information, specifically looking at the type of the filesystem used. If the file is a preserved mount namespace it should be NSFS_MAGIC or PROC_SUPER_MAGIC (on older kernels). Anything else indicates that this is likely a regular file placed on the tmpfs that is mounted at /run . Unless the file looks like a preserved mount namespace we don’t try to inspect it further and just proceed by creating a fresh mount namespace ourselves.

The inspection process is somewhat involved. To know that a mount namespace is stale we want to compare the major:minor number of the root filesystem inside the namespace to the major:minor number of the current revision of the base snap we are operating against. We start by resolving the symbolic link /snap/${base snap name}/current (note that /snap may in fact be /var/lib/snapd/snap but for brevity I will not spell out the alternate form each time anymore. Knowing the right revision we proceed to scan /proc/self/mountinfo looking for a mount entry that matches /snap/${base snap name}/${base snap revision}. The mount info table also contains the major:minor numbers so we record those.

We then create an eventfd file descriptor used for extremely simple inter-process communication and fork a child process that will look at the inside of the mount namespace. The child re-associates its mount namespace to that described by the preserved mount namespace file descriptor we hold open and analyses the mount info table again. Using similar logic it records the major:minor numbers of the filesystem mounted at /. Armed with prior knowledge it compares the pair of major:minor numbers and sends the result of the comparison via the eventfdfile descriptor and exits.

The parent receives the comparison and waits for the child to terminate. It is worth noticing that at any time when additional processes are involved we follow a special protocol to ensure that we don’t wait indefinitely on IPC or react incorrectly to unexpected death of either the parent or the child. This involves a timeout of three seconds and prctl -based self-termination in case the parent process dies, followed by kill-based re-assurance that the parent is operational before attempting to perform any dedicated logic.

In the parent process, armed with the status of staleness we can either directly re-use the namespace or consider discarding it. To know if we can safely discard the mount namespace we check if the freezer control group associated with the snap instance is occupied. We do this by inspecting the file cgroup.procs in the /sys/fs/cgroup/freezer hierarchy dedicated to the snap instance. If the file is non-empty, as determined by reading form it, we know that some processes are still there and we must just use a stale mount namespace for now.

Note that this kind of check works regardless of the nature of those processes. They can be long-running services, short lived applications or even snap-specific hooks.

If the mount namespace is both stale and empty we detach the preserved mount namespace file and proceed to setup a fresh mount namespace

Create a fresh mount namespace

Once we take a decision to create a new mount namespace we fork of a support process, using the same safeguards as described above. The child process stays in the initial mount namespace while the main process performs all of the operations needed for setup. Eventually the helper process will be told to preserve the constructed mount namespace by bind mounting the /proc/${parent process pid}/ns/mnt file over an empty file in /run/snapd/ns/$SNAP_INSTANCE_NAME.mnt. This operation cannot be performed from the inside because of how the sharing of /run/snapd/ns is set up and because of the corresponding kernel limitations.

The rest of this sections describes how the main process takes subsequent steps to prepare the mount namespace for snap applications.

The first step is to unshare the mount namespace. This gives us a separate copy of the mount table, initialised to the mount table we just shared with every other process in the initial mount namespace. From there on we can make modifications with mount and unmount that will affect our table but not that of other processes, unless the mount event propagates to their namespace. The aspect of event propagation is unique to the mount namespace as it really means that while distinct mount namespaces indeed have separate mount tables they may still be connected because some of the mount entries in both of them will share propagation events from one place to another. We are using this feature heavily in snap-confine. As an example when snapd asks systemd to mount a squashfs filesystem corresponding to a revision of some snap, systemd, running as pid-1, runs mount to perform the operation. Both of those processes (init and mount) will be in the initial mount namespace but the mount event will propagate to the mount namespace of every snap in the system, thus the snap will appear mounted in all of them.

Populate a mount namespace

With the fresh mount namespace in hand we can configure it by taking a careful sequence of steps. It is important to emphasise that we start with pretty much arbitrary configuration of the host. We will transform it ending up with consistent mount namespace with predictable location of everything required to run snap applications.

We do need to know a little bit about the host so we begin by looking at the os-release file, either in /etc/os-release or in /usr/lib/os-release. We do this to know if we are dealing with a classic system, like Debian, Fedora or SUSE or a core system that is entirely comprised of snap applications. As you will see, we special case the core16 distribution and use the so-called legacy mode for it. After core16 (that is, in core18 and beyond) and on all classic distributions the procedure is consistent and identical but on core16 for backwards compatibility reasons we don’t perform some steps.

Normal bootstrap process

In the more normal mode we begin by creating a temporary directory /tmp/snap.rootfs_{random string}, known as scratch directory, where we will construct a filesystem hierarchy based on the base snap. The root filesystem (the initial one) is switched to recursive shared mount event propagation. This is meant to ensure consistency in case the host is running without systemd as pid-1 (e.g. like in the case of Ubuntu 14.04) or when the host is a system container (e.g. a Ubuntu 16.04 running inside LXD).

We then proceed to make the scratch directory a unbindable mount point. This contraption has a specific purpose. We really want to make sure that the mounted contents of the scratch directory is not replicated anywhere. This prevents mount loops from occurring (along with their disastrous consequences).

With the scratch directory prepared we bind the base snap over the scratch directory and change its sharing to slave mode. Slave sharing (forgive the name) means that events travel only in one direction, from the source to the peers but not back. This allows us to receive mount events but not send them back.

We then proceed to bind mount a set of directories still common with the initial mount namespace over subdirectories that exist in the base snap that now, also, inhabits the scratch directory. Those are usually recursively bind mounted from the source to the destination. Recursively because some of those directories contain other mount entries that we don’t want to replicate manually. The /snap directory is the perfect example, containing numerous mounted revisions of numerous installed snaps. In addition to the recursive bind mount, most of the directories are also switched to slave event propagation. This is done so that, in the case of a mount event occurring inside a snap mount namespace, the event does not arrive into the initial mount namespace with unexpected consequences there. There are some exceptions to this, specifically the /media and /run/netns directories are shared with the initial mount namespace so anything mounted inside the snap mount namespace does show up on the host (and by extension, back in mount namespaces of other snaps). This allows snaps that use network namespaces to be managed on the outside and disk management applications like udisks to mount removable media in a way accessible to other application processes. The complete list of directories is present in the source of snap-confine but we can highlight some here: /dev , /etc (with all the consequences), /home, /root, /proc, /sys, /tmp, /var/lib/snapd, /run and /mnt. Some other are mounted optionally, only if they exist in both the base snap and on the host at the same time.

Since we do reuse the /etc directory from the host we must make some corrections to avoid host configuration to negatively impact snap applications. We bind /etc/alternatives, /etc/ssl and /etc/nsswitch.conf back from the base snap to the scratch directory. This ensures that whatever is shipped in the base snap is really effective at runtime.

On core systems we also bind mount the directory containing the snap-confine executable (as inspected by looking at the target of the special symlink /proc/self/exe) over to the /usr/lib/snapd inside the scratch directory. This allows us to have a mostly empty base snap that still supports essential snapd tools that must exist inside each mount namespace.

We now bind mount the snap directory, either /snap or /var/lib/snapd/snap over the /snap directory present in the base snap (still living in the scratch directory). This allows us to offer consistent location inside the mount namespace, regardless of how the host is configured.

We are now very close to the essential step. The plunge inside. We w prepare /var/lib/snapd/hostfs as a private mount point (again, by using the self-bind mount trick). Private mount points are again about sharing. Such mount points have no events propagation in either direction. We do this to ensure that nothing we do on the inside or outside affects our view of the host filesystem.

We are almost done, the last step is to configure user space driver for Nvidia hardware. This is described in a separate section because of its complexity. See below for that.

Inevitable we arrive at the critical moment. We perform the super-chroot operation known as pivot_root. Specifically we pivot_root into the scratch directory storing the old root in /var/lib/snapd/hostfs. Unlike chroot, the operation affects all the processes that share our mount table instantly. Unlike chroot it effectively swaps the root directory with a given directory and then proceeds to put the old root directory in yet another directory.

The operation performed means that the shared libraries, executables, data files and anything else present in the scratch directory is now the new root filesystem. This place is mostly ready for application execution. The dynamic linker is in the right spot, as are the shared libraries, runtimes and helper programs. There is still a tiny bit of adjustment we will do, we will also apply the mount profile specific to this snap, as described by snapd, but for the most part, this is it.

The last part of cleanup is to detach the trio of special filesystems, /proc, /sys and /dev that we see the copies of in the /var/lib/snapd/hostfs/proc (and so on). We do this to avoid confusing snap applications with unexpected locations. We learned this the hard way.

Nvidia driver configuration

TBD, this is long and complex.

Differences in the legacy bootstrap process

The legacy process is a subset of the normal process. It is only used when the distribution is a core distribution, comprised of only snap packages, using the core16 base snap as the boot snap and when the snap we operate against is also using core16 as its base snap. This implies that the initial mount namespace is actually similar to what we wanted. We moved away from this model because the introduction of bases made things less consistent. In addition the composition of the initial mount namespace on core devices is complex and we wanted to avoid potential issues between subtle differences of a snap running on a core16 and a core18 system.

In the legacy mode we don’t use the base snap as the initial root filesystem, we use the real root filesystem, real /. We also don’t need to bind mount that many directories over (because we already have them), just /media and /run/netns, so that bi-directional sharing is applied.

Create private /tmp

To isolate snaps from the system we make make an isolated, empty /tmp unique for each snap application. Instead of mounting a tmpfs, which has the potential to be insufficiently small on lower-end devices, we create a sub-directory of /tmp, remember that at this time we still share the real directory with the host, and bind mount that subdirectory over /tmp. Having done that we proceed to make /tmp private for mount event propagation so that neither the host can affect the snap nor vice-versa.

Interestingly this allows one to inspect the per-snap /tmp by looking at the host’s /tmp/snap.${numeric user ID}_{snap name}_{random string}.

Create private /dev/pts and /dev/ptmx

TBD

Apply the /var/lib/lxd quirk

TBD

Apply the mount profile

Snapd creates mount profiles by writing fstab-like file /var/lib/snapd/mount/$SNAP_INSTANCE_NAME.fstab . Unlike snap interfaces, snap mount profiles are not specific to a specific application. That is, all applications in a given snap see the exact same mount namespace. Up until now the mount namespace was configured in pretty much exactly the same way for every snap operating on a given system. With the advent of content interface connections and snap layouts this has changed.

Mount profiles are, like other parts of the interface system, dynamic. A change in the state of a content interface connection is dynamically reflected in the mount namespace of the affected snap. This task is performed by snap-update-ns. It is a diff-and-patch tool specific to mount namespaces. The tool consumes two mount profiles, the one that is currently effective in a given mount namespace and a second one that we wish to end up with, computes the difference in terms of mount operations to perform and then subsequently applies the „patch” by executing mount and unmount operations. The reality is a bit more complicated but for the purpose of setting up a new mount namespace, this description is sufficient.

When the current profile is empty (it is a new mount namespace after all) then the diff is trivial, we simply perform the operations in the mount profile one by one. This is exactly the capacity to which snap-confine uses snap-update-ns. There is one small twist though, we open a file descriptor for snap-update-ns and keep it across the pivot_root call. Explicitly stated, we open a program we wish to execute, then change our mount namespace so that the original path we opened the program from may no longer exist and then we execute the program using the file descriptor alone.

When invoked from snap-confine, snap-update-ns does not use setns itself and in result, changes the same mount namespace it was started in.

Apply per-user mount profile

TBD

Applying the per-user mount profile

TBD

6 Likes

Out of interest, is there any reason why we put the user ID in this directory name? The private tmp directory is shared by all invocations of apps for a snap, so it could easily end up containing temporary files owned by multiple users.

You’ve got the random data at the end, so it doesn’t really add uniqueness.

That’s one of the questions I wrote down as I was going through this. I have a patch to drop that part. I don’t think there is any reason to do that anymore especially since this is not unique to the user. It is shared by all the users of the snap.

2 posts were split to a new topic: Cannot fsck volume mounted on /var/snap