Mount namespace walkthrough [WIP]

The following is a WIP, and also still somewhat “stream of consciousness” so apologies in advance… The primary purpose of this is to make sure I understand all the code properly and hopefully can be used to produce an up-to-date doc on mount namespaces created from snap-confine eventually.

Snaps mount setup

Strictly confined snaps currently have a non-trivial mount namespace setup. This post is meant to clarify the setup both as documentation and to clarify any misunderstandings I may personally have on the setup.

Initial snap run execution

Snap apps are currently executed by creating a symlink from /snap/bin/the-snap.the-app -> /usr/bin/snap, where the snap binary is able to introspect from the symlink that it is meant to be running a snap app, so it can re-execute via snap run (which then gets passed again to args[0]) what snap app (and thus what snap) is being executed. snap run then eventually will begin executing snap-confine here with the binary that snap-confine then will run being snap-exec.

Side-note: the above setup will likely re-execute from the /usr/bin/snap on the host filesystem into the core/snapd snap version of /snap, but AFAICT, this does not trigger any confinement from running /snap/core/current/usr/bin/snap via AppArmor, etc.

Re-associate with pid 1 to access /run/snapd/ns

The first step in strict confinement is to ensure that we have to enter a known mount namespace that contains the /run/snapd/ns folder, which is mounted by the snapd daemon in the initial mount namespace, so we read the mount namespace of PID 1 at /proc/1/ns/mnt and use that FD with setns(2) in order to be able to read /run/snapd/ns.

TODO: add classic confinement steps in separate post/section

Ensuring “/” or “/snap” are mounted rshared

After this, we do some work to ensure that “/” (and if not then “/snap”) are mounted with the MS_SHARED flag. This is because the mounts that happen underneath /snap from /var/lib/snapd/snaps/the-snap.snap -> /snap/the-snap/the-revision need to be mounted shared so that re-using the mount namespace works when a refresh of the snap happens.

Global mount namespace initialization and private mount check

Next, we ensure that the global mount namespace at /run/snapd/ns/ is mounted privately so that mount events performed there do not propogate to any other peer groups. If /run/snapd/ns is not setup and not currently a mount point, then it is created as a recursive bind mount on top of itself, and then remounted privately. This operation requires a global lock across all snaps.

Per-snap mount namespace initialization + rootfs check

Then begins the per-snap mount namespace initialization. The first check performed here checks to make sure that the rootfs for the snap (i.e. the base snap) exists, and allows for falling back from core to ubuntu-core, and from core16 to core.

Device cgroup enablement

Next, if any interfaces plugged by this app use the udev/device cgroup backend, a device cgroup is setup for the snap and enforced. Whether or not to setup the device cgroup is determined by whether there are any udev rules defined for the snap.

Mount namespace file existence

Next, we check to see if there is a saved and current mount namespace that we can just join directly, or if we need to recreate the mount namespace. This is handled by searching for /run/snapd/ns/the-snap.mnt.

New mount namespace creation
If the mount namespace isn’t up to date, then we setup a mount namespace by first unsharing with specifically only a new mount namespace with unshare(CLONE_NEWNS) and make the “/” mount point recursive and shared so some mounts we create and setup from inside the per-snap mount namespace can be shared back with the host filesystem.

Unbindable rootfs scratch mount

We then proceed to bind mount a scratch directory we created for the new rootfs on top of itself so it can then be remounted as unbindable simply so that we don’t go into recursive bind mount loops because this scratch directory is a sub dir of the rootfs we will be bind mounting on top of the scratch directory.

Rootfs bind mount

Then we bind mount the rootfs of the snap (which can be “/” or “/snap/core/” or “/snap/ubuntu-core”, etc.) onto the scratch directory recursively so we receive all of the subdirectory mounts from the rootfs that may exist and immediately proceed to make the rootfs mount point a recursive slave so that none of the next mounts we perform in that rootfs propagate back to the original rootfs mount we are viewing at our scratch directory.

Host filesystem bind mounts

Then we proceed to bind mount a number of specified directories from the host filesystem, depending on the mode we are executing in, for the normal uc18 + classic distros mode we mount the following from the host onto the rootfs we created at the scratch directory:

  • /dev
  • /etc
  • /home
  • /root
  • /proc
  • /sys
  • /tmp
  • /var/snap
  • /var/lib/snapd
  • /var/tmp
  • /run
  • /lib/modules
  • /usr/src
  • /var/log
  • /media
  • /run/netns
  • /mnt
  • /var/lib/extrausers

Note that some of the above directories are marked as optional, meaning that if the mount fails it’s okay to continue running. For the non-optional directories the mount is performed with a recursive bind mount, then if the mount is meant to be unidrectional turning the mount point into a recursive slave mount.

Legacy host filesystem bind mounts

On specifically ubuntu core16 with a “core” base snap, the following directories are mounted from the host:

  • /media
  • /run/netns
    This list is shortened because for the most part the rootfs from the initial mount namespace for Ubuntu Core 16 is already setup to be from the expected base snap. If we are on Ubuntu Core 16 and the snap has core18 as it’s base then it will use the previous “normal” mode of mounting.

Rootfs bind mount on top of hostfs

Then if we are running in the normal mode, we need to remount a few directories from the desired rootfs of the base snap on top of the host filesystems we just bind mounted into our scratch directory with a bind mount and then making a slave mount. These are:

  • /etc/ssl
  • /etc/alternatives
  • /etc/nsswitch.conf

Handle snapd snap mounting

Next, we proceed to check if we need to mount snapd and the snapd tools from the snapd snap in the case where we have a base like core18 where snapd lives outside of the base snap. This is done by doing a bind mount from the snapd snap and then making it a private mount.

/snap mounting

Next, we mount the directory containing all snap files from the host filesystem into the snap with a recursive bind mount, then a recursive slave mount.

Pivot_root preparation with put_old

Next, we prepare a put_old directory for the old rootfs to be mounted when we perform a pivot_root. The current location of the put_old directory is the var/lib/snapd/hostfs underneath the scratch directory. An undocumented requirement for put_old is that it must be mounted private and not propogate any mount events anywhere, so we bind mount that directory on top of itself (to ensure it is a mount point) and then mount it private.

Nvidia stuff on classic

TODO

Pivot_root setup with hostfs

Then we perform a pivot_root into the scratch directory (so that it shows up as “/”) with the old rootfs at put_old.

After this, we unmount the self-bind mount on the scratch directory on the old rootfs which is now at /var/lib/snapd/hostfs so that we can remove the scratch directory and effectively clean it up.

Next we mount the old root filesystem as a recursive slave so that we can’t modify the original host filesystem from that mount point

We also unmount the /var/lib/snapd/hostfs/dev and /var/lib/snapd/hostfs/sys and /var/lib/snapd/hostfs/proc because all of these filesystems existing in two places in the same rootfs can confuse some applications like docker which do not use the filesystem but instead inspect the mount itself from something like mountinfo.

Per-snap /tmp

Next the per-snap somewhat persistent /tmp is setup which is really a subdirectory of /tmp in the initial mount namespace in the rootfs of the host such as /tmp/snap.the-snap, then creating /tmp/snap.the-snap/tmp and mounting that as /tmp inside the snap’s mount namespace with a bind mount on top of /tmp and then making it private.

Per-snap pts

Next, the per-snap instance of the pts subsystem is setup by mounting a new devpts instance that is multi-instance by mounting with newinstance option and ptxmode=0666 since the host rootfs instance is mounted single instance and has ptxmode=0000. Then /dev/pts/ptmx is bind mounted on top of /dev/ptmx.

Calling snap-update-ns

Then, we call snap-update-ns to setup the bind mounts that are enforced by specific security backends that the app uses by plugging certain interfaces.

TODO

Save mount namespace

TODO

Per-user mount namespace setup

TODO

Content interface

TODO

Layouts

TODO

Notably this is not a recursive bind mount. Perhaps this is on purpose but it’s not clear to me why it shouldn’t be a recursive bind mount. See https://github.com/snapcore/snapd/blob/master/cmd/snap-confine/mount-support.c#L298-L362

In more detail, connecting an interface makes snapd add udev rules tagging devices corresponding do that interface for a given snap. During setup,snap-confine queries udev for devices matching a given snap tag. The device cgroup is set up only of any devices were matched. Otherwise there’s no device cgroup at all, which in turn may be a bit unexpected.

this is not super clear but iirc on UC 16, we use the host root filesystem and not the core snap (though the former comes from the snap) as rootfs for a snap with core base and there is no pivot_root, this is what the comments in the code call “legacy” mode.

Right that’s what I understood from the code is that the host rootfs comes from the core snap on Ubuntu Core 16, so when the base of a snap is core no pivot_root is necessary as we are already using that as a rootfs when snap-confine runs.

I was wondering what the implications would be if we dropped the legacy behavior and instead always used pivot_root in “normal” mode. It would require some changes and would need significant testing to ensure no regressions or bugs, but it could simplify the codebase in snap-confine and make it easier to make future changes without having to take into account the special cased behavior of Ubuntu Core 16.