The following is a WIP, and also still somewhat “stream of consciousness” so apologies in advance… The primary purpose of this is to make sure I understand all the code properly and hopefully can be used to produce an up-to-date doc on mount namespaces created from snap-confine eventually.
Snaps mount setup
Strictly confined snaps currently have a non-trivial mount namespace setup. This post is meant to clarify the setup both as documentation and to clarify any misunderstandings I may personally have on the setup.
Initial snap run execution
Snap apps are currently executed by creating a symlink from
/usr/bin/snap, where the
snap binary is able to introspect from the symlink that it is meant to be running a snap app, so it can re-execute via
snap run (which then gets passed again to
args) what snap app (and thus what snap) is being executed.
snap run then eventually will begin executing snap-confine here with the binary that snap-confine then will run being snap-exec.
Side-note: the above setup will likely re-execute from the
/usr/bin/snap on the host filesystem into the core/snapd snap version of
/snap, but AFAICT, this does not trigger any confinement from running
/snap/core/current/usr/bin/snap via AppArmor, etc.
Re-associate with pid 1 to access /run/snapd/ns
The first step in strict confinement is to ensure that we have to enter a known mount namespace that contains the
/run/snapd/ns folder, which is mounted by the snapd daemon in the initial mount namespace, so we read the mount namespace of PID 1 at
/proc/1/ns/mnt and use that FD with
setns(2) in order to be able to read
TODO: add classic confinement steps in separate post/section
Ensuring “/” or “/snap” are mounted rshared
After this, we do some work to ensure that “/” (and if not then “/snap”) are mounted with the
MS_SHARED flag. This is because the mounts that happen underneath /snap from /var/lib/snapd/snaps/the-snap.snap -> /snap/the-snap/the-revision need to be mounted shared so that re-using the mount namespace works when a refresh of the snap happens.
Global mount namespace initialization and private mount check
Next, we ensure that the global mount namespace at /run/snapd/ns/ is mounted privately so that mount events performed there do not propogate to any other peer groups. If /run/snapd/ns is not setup and not currently a mount point, then it is created as a recursive bind mount on top of itself, and then remounted privately. This operation requires a global lock across all snaps.
Per-snap mount namespace initialization + rootfs check
Then begins the per-snap mount namespace initialization. The first check performed here checks to make sure that the rootfs for the snap (i.e. the base snap) exists, and allows for falling back from core to ubuntu-core, and from core16 to core.
Device cgroup enablement
Next, if any interfaces plugged by this app use the udev/device cgroup backend, a device cgroup is setup for the snap and enforced. Whether or not to setup the device cgroup is determined by whether there are any udev rules defined for the snap.
Mount namespace file existence
Next, we check to see if there is a saved and current mount namespace that we can just join directly, or if we need to recreate the mount namespace. This is handled by searching for
New mount namespace creation
If the mount namespace isn’t up to date, then we setup a mount namespace by first unsharing with specifically only a new mount namespace with
unshare(CLONE_NEWNS) and make the “/” mount point recursive and shared so some mounts we create and setup from inside the per-snap mount namespace can be shared back with the host filesystem.
Unbindable rootfs scratch mount
We then proceed to bind mount a scratch directory we created for the new rootfs on top of itself so it can then be remounted as unbindable simply so that we don’t go into recursive bind mount loops because this scratch directory is a sub dir of the rootfs we will be bind mounting on top of the scratch directory.
Rootfs bind mount
Then we bind mount the rootfs of the snap (which can be “/” or “/snap/core/” or “/snap/ubuntu-core”, etc.) onto the scratch directory recursively so we receive all of the subdirectory mounts from the rootfs that may exist and immediately proceed to make the rootfs mount point a recursive slave so that none of the next mounts we perform in that rootfs propagate back to the original rootfs mount we are viewing at our scratch directory.
Host filesystem bind mounts
Then we proceed to bind mount a number of specified directories from the host filesystem, depending on the mode we are executing in, for the normal uc18 + classic distros mode we mount the following from the host onto the rootfs we created at the scratch directory:
Note that some of the above directories are marked as optional, meaning that if the mount fails it’s okay to continue running. For the non-optional directories the mount is performed with a recursive bind mount, then if the mount is meant to be unidrectional turning the mount point into a recursive slave mount.
Legacy host filesystem bind mounts
On specifically ubuntu core16 with a “core” base snap, the following directories are mounted from the host:
This list is shortened because for the most part the rootfs from the initial mount namespace for Ubuntu Core 16 is already setup to be from the expected base snap. If we are on Ubuntu Core 16 and the snap has core18 as it’s base then it will use the previous “normal” mode of mounting.
Rootfs bind mount on top of hostfs
Then if we are running in the normal mode, we need to remount a few directories from the desired rootfs of the base snap on top of the host filesystems we just bind mounted into our scratch directory with a bind mount and then making a slave mount. These are:
Handle snapd snap mounting
Next, we proceed to check if we need to mount snapd and the snapd tools from the snapd snap in the case where we have a base like core18 where snapd lives outside of the base snap. This is done by doing a bind mount from the snapd snap and then making it a private mount.
Next, we mount the directory containing all snap files from the host filesystem into the snap with a recursive bind mount, then a recursive slave mount.
Pivot_root preparation with put_old
Next, we prepare a put_old directory for the old rootfs to be mounted when we perform a pivot_root. The current location of the put_old directory is the var/lib/snapd/hostfs underneath the scratch directory. An undocumented requirement for put_old is that it must be mounted private and not propogate any mount events anywhere, so we bind mount that directory on top of itself (to ensure it is a mount point) and then mount it private.
Nvidia stuff on classic
Pivot_root setup with hostfs
Then we perform a pivot_root into the scratch directory (so that it shows up as “/”) with the old rootfs at put_old.
After this, we unmount the self-bind mount on the scratch directory on the old rootfs which is now at /var/lib/snapd/hostfs so that we can remove the scratch directory and effectively clean it up.
Next we mount the old root filesystem as a recursive slave so that we can’t modify the original host filesystem from that mount point
We also unmount the /var/lib/snapd/hostfs/dev and /var/lib/snapd/hostfs/sys and /var/lib/snapd/hostfs/proc because all of these filesystems existing in two places in the same rootfs can confuse some applications like docker which do not use the filesystem but instead inspect the mount itself from something like mountinfo.
Next the per-snap somewhat persistent /tmp is setup which is really a subdirectory of /tmp in the initial mount namespace in the rootfs of the host such as /tmp/snap.the-snap, then creating /tmp/snap.the-snap/tmp and mounting that as /tmp inside the snap’s mount namespace with a bind mount on top of /tmp and then making it private.
Next, the per-snap instance of the pts subsystem is setup by mounting a new devpts instance that is multi-instance by mounting with newinstance option and ptxmode=0666 since the host rootfs instance is mounted single instance and has ptxmode=0000. Then /dev/pts/ptmx is bind mounted on top of /dev/ptmx.
Then, we call snap-update-ns to setup the bind mounts that are enforced by specific security backends that the app uses by plugging certain interfaces.
Save mount namespace
Per-user mount namespace setup