Fixing live propagation of mount changes

Hey everyone. I’m working towards having a new internal tool/command for snapd. The tool is snap-update-ns and is designed to jump into an existing mount namespce of a given snap and change it so that it reflects interface connections. This is something that snapd would use automatically when e.g. content interface is connected or disconnected. If you run a program snapd keeps a mount namespace around (the set of mounted things) that is separate from the rest of the system. Once that namespace is created and you connected an interface nothing (currently) happens. Since this is a very common and annoing behavior we would like this to “just work” like magic, without any hand-holding from the user.

I have a few PRs for this:

Apart from this I already landed the main algorithm that l loads two files and computes the mount / unmount actions needed to convert the “current” world into the “desired” world. This function can be found here:

1 Like

For some background on this change, today when a snap is already running and then something else needs to perform changes in its mount namespace (say, a content interface was connected), these changes are not propagated. Instead, we need to forcefully unmount /run/snapd/ns/<snap name>.mnt and then remove that file, so that new processes will pick up the new state. Even that is problematic, though, because it creates a split world, where previously running processes will be on top of a different filesystem from new processes or restarted daemons.

So, that’s why @zyga-snapd is working on the snap-update-ns tool, which will allow us to perform live changes on the running filesystem, so every process of the same snap continues to see the same independent filesystem as usual today, with modifications to that being applied on demand.

I need a review for a trivial function for parsing whole fstab files:

I also have the writer that is even easier to review:

A few late comments on #3102.

Thank you, here’s a PR with those addressed:

So with some more progress over the weekend I have the following:

  • I can now do almost everything required to make actual mount changes work
  • I need to parse mountinfo data to know when something has already taken place (e.g. the desired change is has already happened)
  • I need to track what changes were made and save the current profile

There are some misc functions I had to create, I will put most of them in the snap-update-ns command itself but some will flock to interfaces/mount. The first one is mount.OptsToFlags that converts string mount options to integer flags for syscall.Mount. There are also some miscellaneous functions to lock/unlock the lock associated with each namespace. I’ll be proposing them in their individual PRs:

For now I have one new PR that I’d like to just land quickly as it introduces a data type:

@zyga-snapd Reviewed #3114… looks pleasantly simple, but I think atomicity is a must there.

Ah, indeed. I’ll make a quick follow up ASAP.

One more PR for the effort: mount option to mount flag converter:

Actually, I realized that since we use a writer this is not the right level. I’ll use the atomic helper around that to ensure it’s all good though.

@zyga-snapd That doesn’t sound great. The function itself should be responsible for writing to the filesystem atomically, instead of expecting everyone that calls it to be enforcing atomicity.

In general, it’s also a bad idea to merge branches that have controversial points being discussed. When someone sends a review saying “Request Changes”, it means the points need to be fixed or discussed with the author of the comment, otherwise reviews become a bit pointless.

If that happens frequently, we’ll need to go back to the approach we used for a short while, where PRs with such review statuses are blocked until the original reviewer changes the status. I’d prefer to remain using the current practice where we just trust the author to respect comments made.

The current pair of fuctions take io.{Reader,Writer}. I wrote a pair that just saves/loads to a file by name. This was easier to test and is more flexible as well. The function that writes to file by name does use atomic helpers now.

1 Like

I have some more progress:

The last one is just a stub for figuring out naming and will be used by the subsequent algorithm that checks if a mount change is needed given mountinfo data.

I really really need a review for - I won’t be able to really do anything for real before that lands.

Hey, I wanted to update you on where we are. In PR3216 you can read the general high-level algorithm. The general idea is as follows:

  • Read the current and desired mount profiles. Those are fstab-like files that are written by snapd (desired profile) and snap-confine and snap-update-ns (current profile)
  • Diff the profiles so that we know what we should mount or unmount
  • For each such mount change, see if we need to perform it in practice (aka Change.Needed function). This is because we may be upgrading from an older snap-confine that doesn’t write the current profile or because the snap may have mounted something by itself.
  • Collect all the applied changes (we don’t stop if an error occurs) and write the current profile for next time we run snap-update-ns.

I got stuck on Change.Needed as it is unfortunately non-trivial to compute the answer. The problem is what we are working with as in data representation given by the kernel. This is described in a mountinfo file which is documented here.

Mountinfo contains a rather raw representation of the kernel mount table. Each mount has an MountID and a ParentID, a MountDir that described where something is mounted, a MountSource and FsType that describe what is mounted and, curiously, Root which describes the subtree of the MountSource that is mounted, this is essential as we will see shortly.

This information is relatively simple to process for regular mounts (so not bind mounts and not things like tmpfs that are not associated with a block device). A super simple version of Change.Needed that works with this data is present in PR3209. Unfortunately snapd relies on bind mounts heavily so we need something better or more general.

The first problem with bind mounts is that the fstab-like file that describes our intent doesn’t contain absolute information. The source mount is “resolved” at the time when the mount is performed. To illustrate this contrast two fstab-like entries, one which describes a regular mount and another that describes a bind mount:

/dev/sda2 / ext4 errors=remount-ro 0 1
/snap/ubuntu-app-platform/34 /snap/lonewolf/3/ubuntu-app-platform none bind,ro 0 0

In the first case we know that we’re mounting /dev/sda2. In the second case we have no idea what we are mounting, we can only analyse the mount table and deduce what is mounted on /snap/ubuntu-app-platform/32. What is worse is that we can bind mount something that is not itself a mount point. To illustrate this imagine that the last line read like this instead:

/snap/ubuntu-app-platform/34/stuff /snap/lonewolf/3/ubuntu-app-platform none bind,ro 0 0

Now stuff is being bind-mounted to /snap/lonewolf/3/ubuntu-app-platform but stuff is a directory under (presumably) some revision of the ubuntu-app-platform squashfs.

Now let’s examine a similar (but different because I already have the test data) case. Let’s start with the shell commands that I performed:

$ mkdir data
$ sudo mount -t tmpfs none data
$ cd data                                                                                                                     
$ mkdir foo bar
$ sudo mount --bind foo bar                                                                                                                               
$ mkdir froz
$ mount --bind bar froz                                                                                                                                 

The mountinfo table says this (I left out irrelevant parts):

25 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
392 25 0:47 / /home/zyga/data rw,relatime shared:158 - tmpfs none rw
400 392 0:47 /foo /home/zyga/data/bar rw,relatime shared:158 - tmpfs none rw
408 392 0:47 /foo /home/zyga/data/froz rw,relatime shared:158 - tmpfs none rw

The question we would now like to answer is: Has the mount --bind /home/zyga/data/foo /home/zyga/data/froz already occured?

My initial take on this was that we want to understand what /home/zyga/data/foo really is and then look for anything else that is the same thing. Here we can analyse the mountinfo table and come up with an answer that /home/zyga/data/foo is a fragment /foo of the tmpfs mounted ID 392. Earlier I used MountSource (.e.g /dev/sda2) but it is useless for tmpfs/proc and other virtual filesystems so I abandoned that and switched to MountID.

So we can rephrase the question as such: is “/foo”@MountID:492 present at /home/zyga/data/froz ?

Here the complexity and my uncertainty lies in the fact that the relevant mountinfo entry feels somewhat ambiguous:

408 392 0:47 /foo /home/zyga/data/froz rw,relatime shared:158 - tmpfs none rw

So we know that /home/zyga/data/froz is /foo from some tmpfs but we have no way to say if that is really the one with MountID 392. It seems like the ParentID is the key here but I don’t know if a simple match against it is sufficient. I tried many different examples yesterday and each time I got to a “oh, let’s use this” I could create another example where that wasn’t sufficient.

I wanted to share the problem statement so that code review can be done more meaningfully and everyone can share their ideas.

As a small update. I wanted to explain why the shared:xxx or similar optional fields are not the solution. They indicate mount event propagation among peer groups. We can easily construct a mount table that doesn’t agree on the value of the optional fields yet contains the exact same bind mount. What happens is that mount events may or may not be propagated but the contents of the actual filesystems is exactly the same *

  • as long as you keep exploring one filesystem

Update: this issue is now resolved. @pedronis wisely suggested to use the st_dev field that identifies the device that is mounted. I’m now working only on the problem of one mount shadowing entries from another but I definitely got unstuck.

If you are interested in following the development look at latest commits in this branch

1 Like