Bug? Saves are blocked to $SNAP_USER_DATA if snap updates when it is already running

mcphail · December 18, 2017, 1:31am

If a snap is being run by a user and it automatically updates in the background, any attempt by the user to save to $SNAP_USER_DATA will now be blocked by apparmor. This could lead to data loss if the user has been writing an essay on a wordprocessor snap, or saving on exit from a game.

I have made a test snap at https://github.com/mcphail/updateblockssave . To test it, open up 2 terminals.

TERMINAL1:
$ snap install --dangerous updateblockssave_0.1_amd64.snap

TERMINAL2:
$ updateblockssave
Press ENTER. 2 files should be written to ~/snap/updateblockssave/current/ indicating success.
$ rm ~/snap/updateblockssave/current/* to get rid of them

TERMINAL2 again:
$ updateblockssave
This time, don’t press ENTER at the prompt. Instead switch to
TERMINAL1:
$ snap install --dangerous updateblockssave_0.1_amd64.snap
…Then switch back to TERMINAL2 and press ENTER. This time, the files should fail to save and there will be apparmor denials.

Is this a bug or intended behaviour? I feel it is a bug, and a very dangerous one for user data.

jdstrand · December 18, 2017, 1:53pm

It’s really both-- it is a design decision to not allow writes to other revision’s data since the data format may have changed and we don’t want to break reverts. The bug is that the user experience is terrible. Many resort to using SNAP_USER_COMMON to workaround this. This is https://bugs.launchpad.net/snapd/+bug/1616650

mcphail · December 18, 2017, 3:10pm

Thanks @jdstrand, that does seem to be the issue. Of note, the snap is not trying to write to a different version’s directory: apparmor stops it writing to its own version directory. I think this would be solved if a snap refresh was held in a pending state until all instances of the snap are closed. Is that technically feasible? Otherwise I can’t see a way around this without breaking versioned saves.

Another thing I’ll need to test when I get home is whether apparmor allows writes to $SNAP_USER_COMMON if the snap has been refreshed in that background, or if this gets blocked as well.

jdstrand · December 18, 2017, 5:46pm

AppArmor won’t block SNAP_USER_COMMON across a refresh.

For SNAP_DATA and SNAP_USER_DATA, what is happening is that the allowed version is changed during runtime. Eg, snap foo at revision r1 is started and so it has write access to ~/snap/foo/1. While it is running, the snap is refreshed to r2 and snapd updates the apparmor policy to give it write access to ~/snap/foo/2 and readonly access to ~/snap/foo/1. The running snap has no idea this happened-- everything in its environment still says ~/snap/foo/1 is ok and any open file descriptors still point to ~/snap/foo/1. See me comment in the bug for something snapd could do: https://bugs.launchpad.net/snapd/+bug/1616650/comments/5 (which is what you suggested).

mcphail · December 18, 2017, 6:16pm

Aah. Thanks. My New Year’s resolution will be to read bug reports all the way to the bottom

mcphail · April 3, 2018, 7:16am

Is there any ETA for a fix for this yet? The current implementation breaks versioned saves for every application and game and is a major failing in the snap infrastructure. I saw it was on the agenda for the recent sprint but looks as if other things took priority. Does anyone feel like diving in to give this issue a bit of love?

chipaca · April 3, 2018, 9:22am

@mcphail we’re aware of the problem, @niemeyer thought of a workaround, then with @jdstrand and @zyga-snapd we thought of a better one, and it’s on @zyga-snapd’s plate for working on asap.

For the record, the plan is, as I recall it, that as a snap refresh is attempted, if there is a process from that snap running, the refresh is postponed (within reason). At the same time we’d make SNAP_DATA and SNAP_USER_DATA point to the current symlink instead of the current revisioned directory.

We might look into alerting the user of this if possible, but I don’t think that’s part of the first iteration of the fix.

zyga-snapd · April 3, 2018, 10:46am

Yes, essentially the idea is to do the actual refresh when it’s safe to do so. I will get to technical details when I start working on this issue.

mcphail · April 3, 2018, 11:40am

Cheers guys. I appreciate your hard work.

niemeyer · April 3, 2018, 6:38pm

@zyga-snapd @chipaca The last plan we discussed was still to do the changes we talked about to make refreshes a non-issue. I’m keen on doing that instead of fiddling with refresh times which will bring more complexity and other issues without actually solving the main problem. Not refreshing is not a way to refresh safely.

zyga-snapd · April 3, 2018, 7:01pm

I didn’t look at my notes or at the discussions we had before so I may have missed that. I will definitely review everything before working on this.

jdstrand · April 4, 2018, 1:45pm

Can the full plan be put into this topic? I suggested something else to @chipaca due to complications with environ not being able to be updated as well as issues with open file descriptors.

chipaca · April 4, 2018, 1:50pm

After checking with @zyga-snapd, I think indeed Gustavo’s forgetting the hangout we had going over our conversation with you, jj, and zyga, in which the task was moved from my plate to zyga’s because of it.

But zyga’ll update with more detailed notes I’m sure.

kyrofa · May 7, 2018, 8:26pm

Hey @zyga-snapd, any update?

zyga-snapd · May 8, 2018, 8:19am

I haven’t started working on this yet. I will summarize my thoughts and put them into writing here soon though as I could start working on this in parallel to other tasks.

zyga-snapd · December 10, 2018, 6:14pm

goal: refresh when no process running
is refresh-mode or other things affecting this?
if process runs for too long, what then?
- kill it?
- refresh the old way?
- run a hook and ask it to quit?
we can refresh on early boot if the snap is cached
technical aspect: refreshing when freezer cgroup is empty
- cgroup v2 has active notification feature, we cannot use it
- cgroup v1 requires filesystem monitoring or polling
- notification enabled/disabled only ahead of refresh inside snapd, should not be costly in practice
technical aspect: unclear how to behave when refresh is forbidden
- refresh forbidden when new revision is not ready
- download not a problem, can happen before
- seems that main problem is copying data, can take arbitrarily long
- some parallels with system-key behaviour
- we can just block and wait until ready
- we need a new mechanism, locks timeout after 3 seconds
snapd process needs to monitor cgroup vacancy
- must acquire lock, recheck, write the refresh-in-progress marker
- refresh-in-progress marker removed when new snap revision is active
idea: parts of refresh check if it is safe to do so, retry if not
idea: if event comes in that we can refresh now, we trigger the refresh

mcphail · December 10, 2018, 8:20pm

Thanks for finding time for this, @zyga-snapd.

goal: agreed
I don’t think the system should decide how long is “too long” for a process to run. Any tool should be able to be packaged as a snap, and snapd should not interfere with the running of those tools. Nagging a user is not acceptable. Killing a process is not acceptable.
refresh on boot sounds sensible. Can the user directory be copied pre-emptively and then rsync’d (or similar) at the boot refresh to reduce the delay?
I accept it is a nightmare to ascertain whether the snap is currently running a process or when those processes end. I’m not clever enough to come up with a solution for that beyond a (messy) kill -0 poll of all the possible PIDs
I agree the refreshed snap should be downloaded in the background and cached. The copying data bit is a real pain in the neck. It makes me sad that it is 2018 and COW-capable filesystems are still not default on linux distros.

I promise to buy you a beer if you get this working

NMP

zyga-snapd · December 11, 2018, 9:03am

This is actually trivial. We use the freezer cgroup and put all processes of a given snap into a group named after the snap name.

zyga-snapd · December 11, 2018, 12:32pm

This is a quick dump of my thought process:

Design

Snapd gains the ability to postpone the critical data copy that happens during refresh, until all processes associated with the snap being refreshed are terminated. Snaps gain ability to either allow app startup during pending refresh or block new apps from being started. Reliable process enumeration is done using the freezer cgroup available in cgroup v1. Reliable cgroup emptiness is done using release-agent protocol from cgroup v1. We are not considering cgroup v2 yet, since the freezer controller is unavailable yet but in v2 the logic has simpler and more reliable equivalents.

EDIT: the point of the .running file is to couple with inotify to get as-it-happens notification when a refresh should happen. Without that we would need to retry often to catch a moment where we can refresh.

Future work:

Add a refresh-pending hook, to be useful it should be able to run in user session and notify user applications (e.g. Firefox update notification).
This is left out because I believe for desktop applications it would need to be a new type of hook, a user session hook, which we don’t support at this time.
Perform refreshes during early startup, while machine boots, before services are started, assuming the refreshed snap is downloaded before.

Changes per component

Changes to snap-confine

If /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited is present then wait until the file is removed
- we may print a message if PTY is present
After populating the freezer cgroup write /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running

snap-cgroup-v1-release-agent

New C binary, optimised for size and speed
Follows the cgroup-v1 release agent protocol:
- takes single argument, cgroup path relative to mount point of control group
- if cgroup name matches snap.* then unlinks /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running

Additions to snapd snap manager

On startup, the snap manager registers a cgroup v1 release agent on the freezer cgroup. If this fails a flag is stored in the manager that agent is unavailable.
Offer function and event-based check if a given snap is running:
- use inotify on /run/snapd/runctl if release-agent and inotify are available
- fall-back to polling on /run/snapd/runctl if release-agent is available but inotify is not
- fall back to polling on /sys/fs/cgroup/freezer/snap.$SNAP_INSTANCE_NAME/cgroup.procs if release-agent is not available

Changes to snapd snap manager

Refresh tasks gain new task wait-for-apps that runs before data copy is done and:
- if /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running is present, postpones the task
- if event based notification of cgroup vacancy is reported, re-schedules the task immediately
- ensures that new apps cannot be started by writing refresh-pending to /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited.
  - this may be controlled with refresh-pending-behavior: block | allow syntax, defaulting to allow, this is distinct from the refresh-pending hook that is postponed in the initial implementation.
Before restarting services after linking the new revision, unlinks the /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited file

Behaviour on revert of snapd

When snapd is reverted to a version that doesn’t implement this spec and the system reboots, the ephemeral state in /run is lost so it plays no further part. The new wait-for-apps tasks will be ignored by the catch-all task handler, returning to previous refresh logic. We may need some help in case system does not reboot.

mborzecki · December 11, 2018, 4:24pm

Wonder whether this can cause some problems with older releases of systemd. AFAIK recent versions are using v2, but the old releases (like really old) used v1 and registered an cgroup release agent.