Bug? Saves are blocked to $SNAP_USER_DATA if snap updates when it is already running

zyga-snapd · April 3, 2018, 7:01pm

I didn’t look at my notes or at the discussions we had before so I may have missed that. I will definitely review everything before working on this.

jdstrand · April 4, 2018, 1:45pm

Can the full plan be put into this topic? I suggested something else to @chipaca due to complications with environ not being able to be updated as well as issues with open file descriptors.

chipaca · April 4, 2018, 1:50pm

After checking with @zyga-snapd, I think indeed Gustavo’s forgetting the hangout we had going over our conversation with you, jj, and zyga, in which the task was moved from my plate to zyga’s because of it.

But zyga’ll update with more detailed notes I’m sure.

kyrofa · May 7, 2018, 8:26pm

Hey @zyga-snapd, any update?

zyga-snapd · May 8, 2018, 8:19am

I haven’t started working on this yet. I will summarize my thoughts and put them into writing here soon though as I could start working on this in parallel to other tasks.

zyga-snapd · December 10, 2018, 6:14pm

goal: refresh when no process running
is refresh-mode or other things affecting this?
if process runs for too long, what then?
- kill it?
- refresh the old way?
- run a hook and ask it to quit?
we can refresh on early boot if the snap is cached
technical aspect: refreshing when freezer cgroup is empty
- cgroup v2 has active notification feature, we cannot use it
- cgroup v1 requires filesystem monitoring or polling
- notification enabled/disabled only ahead of refresh inside snapd, should not be costly in practice
technical aspect: unclear how to behave when refresh is forbidden
- refresh forbidden when new revision is not ready
- download not a problem, can happen before
- seems that main problem is copying data, can take arbitrarily long
- some parallels with system-key behaviour
- we can just block and wait until ready
- we need a new mechanism, locks timeout after 3 seconds
snapd process needs to monitor cgroup vacancy
- must acquire lock, recheck, write the refresh-in-progress marker
- refresh-in-progress marker removed when new snap revision is active
idea: parts of refresh check if it is safe to do so, retry if not
idea: if event comes in that we can refresh now, we trigger the refresh

mcphail · December 10, 2018, 8:20pm

Thanks for finding time for this, @zyga-snapd.

goal: agreed
I don’t think the system should decide how long is “too long” for a process to run. Any tool should be able to be packaged as a snap, and snapd should not interfere with the running of those tools. Nagging a user is not acceptable. Killing a process is not acceptable.
refresh on boot sounds sensible. Can the user directory be copied pre-emptively and then rsync’d (or similar) at the boot refresh to reduce the delay?
I accept it is a nightmare to ascertain whether the snap is currently running a process or when those processes end. I’m not clever enough to come up with a solution for that beyond a (messy) kill -0 poll of all the possible PIDs
I agree the refreshed snap should be downloaded in the background and cached. The copying data bit is a real pain in the neck. It makes me sad that it is 2018 and COW-capable filesystems are still not default on linux distros.

I promise to buy you a beer if you get this working

NMP

zyga-snapd · December 11, 2018, 9:03am

This is actually trivial. We use the freezer cgroup and put all processes of a given snap into a group named after the snap name.

zyga-snapd · December 11, 2018, 12:32pm

This is a quick dump of my thought process:

Design

Snapd gains the ability to postpone the critical data copy that happens during refresh, until all processes associated with the snap being refreshed are terminated. Snaps gain ability to either allow app startup during pending refresh or block new apps from being started. Reliable process enumeration is done using the freezer cgroup available in cgroup v1. Reliable cgroup emptiness is done using release-agent protocol from cgroup v1. We are not considering cgroup v2 yet, since the freezer controller is unavailable yet but in v2 the logic has simpler and more reliable equivalents.

EDIT: the point of the .running file is to couple with inotify to get as-it-happens notification when a refresh should happen. Without that we would need to retry often to catch a moment where we can refresh.

Future work:

Add a refresh-pending hook, to be useful it should be able to run in user session and notify user applications (e.g. Firefox update notification).
This is left out because I believe for desktop applications it would need to be a new type of hook, a user session hook, which we don’t support at this time.
Perform refreshes during early startup, while machine boots, before services are started, assuming the refreshed snap is downloaded before.

Changes per component

Changes to snap-confine

If /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited is present then wait until the file is removed
- we may print a message if PTY is present
After populating the freezer cgroup write /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running

snap-cgroup-v1-release-agent

New C binary, optimised for size and speed
Follows the cgroup-v1 release agent protocol:
- takes single argument, cgroup path relative to mount point of control group
- if cgroup name matches snap.* then unlinks /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running

Additions to snapd snap manager

On startup, the snap manager registers a cgroup v1 release agent on the freezer cgroup. If this fails a flag is stored in the manager that agent is unavailable.
Offer function and event-based check if a given snap is running:
- use inotify on /run/snapd/runctl if release-agent and inotify are available
- fall-back to polling on /run/snapd/runctl if release-agent is available but inotify is not
- fall back to polling on /sys/fs/cgroup/freezer/snap.$SNAP_INSTANCE_NAME/cgroup.procs if release-agent is not available

Changes to snapd snap manager

Refresh tasks gain new task wait-for-apps that runs before data copy is done and:
- if /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.running is present, postpones the task
- if event based notification of cgroup vacancy is reported, re-schedules the task immediately
- ensures that new apps cannot be started by writing refresh-pending to /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited.
  - this may be controlled with refresh-pending-behavior: block | allow syntax, defaulting to allow, this is distinct from the refresh-pending hook that is postponed in the initial implementation.
Before restarting services after linking the new revision, unlinks the /run/snapd/runctl/snap.$SNAP_INSTANCE_NAME.inhibited file

Behaviour on revert of snapd

When snapd is reverted to a version that doesn’t implement this spec and the system reboots, the ephemeral state in /run is lost so it plays no further part. The new wait-for-apps tasks will be ignored by the catch-all task handler, returning to previous refresh logic. We may need some help in case system does not reboot.

mborzecki · December 11, 2018, 4:24pm

Wonder whether this can cause some problems with older releases of systemd. AFAIK recent versions are using v2, but the old releases (like really old) used v1 and registered an cgroup release agent.

zyga-snapd · December 11, 2018, 4:25pm

I don’t know of a single distribution that ships v2. We don’t support v2. v2 doesn’t support freezer controller yet. In v2 we don’t need the release agent at all since any cgroup hierarchy has an eventfd(2) based notification mechanism that is far more superior.

mborzecki · December 11, 2018, 4:30pm

Fedora:

[guest@localhost snapd]$ mount|grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

Arch:

maciek@galeon:~/work/canonical mount|grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

AFAICT systemd is only managing stuff under cgroup2 mount these days.

zyga-snapd · December 11, 2018, 4:32pm

You have the unified hierarchy but we cannot care less about it. The v1 hierarchy still exists on Fedora 29. The day v2 becomes shippable as a replacement for v1 we can start to look at supporting it.

jdstrand · December 11, 2018, 4:33pm

I think we will want to unconditionally write to syslog though. Since we are waiting and not logging/exiting, this should work fine (in addition to stdout/stderr on a PTY).

Should this be /run/snapd/run*ctl*/snap.$SNAP_INSTANCE_NAME.inhibited? Perhaps not since you are using an inotify on the dir, but it feels slightly odd to have running and runctl.

Also, I wonder if people will abuse this as a crude form of refresh control…

Can you update to say that ‘refresh-pending’ here is the new hook?

What do you mean by “linking the new revision” in this context? The data copy? Moving the “current” symlink (which implies the data copy happened)? After answering, can you update the spec to be more specific on this point?

This isn’t clear-- are you saying 'Behavior on revert of snapd to a version that doesn’t support postponing, or something else? Probably would want to think through scenarios of a snapd revert to a version that still supports postponing since we may want the refresh to happen unconditionally then. Eg, browser snap is running in the user’s session for weeks, etc and the refresh is delayed the whole time, then then user reboots. In this scenario, we want to be careful about the refresh being interrupted during the shutdown since the user may have stopped the browser and rebooted (perhaps snapd tells systemd to inhibit shutdown until done?). Also, if for some reason we are in the startup after a shutdown, if there are any pending refreshes (can this happen with the above inihibit logic?), process them so the user can’t login and start the browser before snapd has a chance to refresh.

One thing that is missing as a statement as to how this interacts with daemon reloads. Ie, right now this is no problem with snapd refreshing a snap with daemons, since those daemons are automatically stopped and restarted. Where does this stop/start logic fit in with the above? How does this interplay with user session autostart (maybe not at all; but please call it out. It is possible for something to be autostarted that doesn’t show any graphical elements (so the user may not be aware of them anymore), so these might only ever be refreshed on reboot) as well as the upcoming dbus session services work from @jamesh?

Lastly, there is no discussion of reminding the user that an update is pending. I know that there was feedback that this is not desired, but as a countpoint, I think a user does want to know since people don’t want to be running an insecure browser, for example. The trick is being tasteful. Whether we force the issue and kill the processes after some threshold I think needs a conversation, but that discussion can happen later. I do think that reminders should be in earlier iterations on the feature though.

zyga-snapd · December 11, 2018, 4:35pm

Yes, I meant runctl. I think it’s file to have both of those files in one directory since the volume of traffic is extremely low.

I accidentally used the same name for a hypothetical new hook and snap refresh property. To be clear: the hook would notify an snap about a pending refresh (perhaps offering the snap to kill itself via snapctl), the property would control the behaviour of starting new application processes during a refresh. Those are distinct and have different purpose.

I mean that after the new revision has launchers and the current symlink on disk. I will update the spec to be more accurate.

Yes, that’s what I meant.

I don’t see why we’d want to make the refresh happen unconditionally. Do note that snapd is exempt from this process since it doesn’t run as a snap application.

For a case like a desktop app running for long amount of time I do believe we need refresh-pending hook being invoked in the context of user session to notify the application.

The idea with refreshing during reboot was similar to how some distributions use systemd for over-reboot refreshes in single user mode. We could mimic some of that.

In the initial version of this work I’m not specifying that accurately since I don’t believe we will implement it straight away.

If you mean systemd reloading itself, I don’t believe it does. If you mean snapd reloading itself, I don’t believe it does either. If you mean snapd restarting services across a refresh it doesn’t have any special handling because I don’t believe it needs any.

As outlined above: snapd keeps stopping services before refresh, keeps starting them after refresh. The new element is that the actual refresh (and associated data copy) will only complete once all processes are terminated. This specifically includes processes that not necessarily have terminated since they were not services.

I believe we might need more knobs to control this for special cases like LXD running long-running containers. Those may need some exceptional settings. What I’m describing is aimed at improving the desktop application data consistency across refreshes. It will also work well with snaps that don’t do anything fancy with their refresh mode and just stop and start services as usual.

The only interplay is that I wish we had support for user-session hooks so that we could notify applications that have such capability to recommend to their users to perform a restart to complete the upgrade.

I believe this is answered above with the refresh-pending hook. This is specifically not included in the current plan since we don’t have support for user session hooks yet.

jdstrand · December 11, 2018, 7:27pm

I was talking about regular refreshes for snaps with daemons. You say here “as outlined above” but I didn’t see it the first time and I still don’t see it (am I blind!?!). My point was that if you don’t think anything needs to be done, just have a single statement in the spec that says as much with (brief) justification as to why. It seems plausibly ok this is fine, but seeing that snap-confine is looking at the ‘inhibited’ file and waiting and snap services are launched via their systemd unit that ultimately ends up calling snap-confine (which would wait if the file was still there), I think it should be called out explicitly that it’s ok.

jdstrand · December 11, 2018, 7:31pm

Well… perform a restart of what? The snap (again, the snap may have a session service that isn’t easily restartable)? The user’s session (which userd could do)? I really think that user services need to be thought through since those processes won’t leave until session restart (or some other new mechanism snapd could implement) and so the refresh will be inhibited, the user might get reminded to stop the snap, won’t be able to, and then the cycle repeats until the user just happens to stop the session.

jdstrand · December 11, 2018, 7:33pm

That is for the snap to opt in. There is also the case of all kinds of desktop applications not refreshing because they don’t implement the hook. userd can provide a notification to restart the session or snaps (see last response though). There are UX considerations here.

jdstrand · December 11, 2018, 7:41pm

zyga-snapd:

Eg, browser snap is running in the user’s session for weeks, etc and the refresh is delayed the whole time, then then user reboots. In this scenario, we want to be careful about the refresh being interrupted during the shutdown since the user may have stopped the browser and rebooted (perhaps snapd tells systemd to inhibit shutdown until done?).

For a case like a desktop app running for long amount of time I do believe we need refresh-pending hook being invoked in the context of user session to notify the application.

The idea with refreshing during reboot was similar to how some distributions use systemd for over-reboot refreshes in single user mode. We could mimic some of that.

In the initial version of this work I’m not specifying that accurately since I don’t believe we will implement it straight away.

I really think this needs to be thought through and implemented in the first iteration. I outlined a problem with the current spec: the browser is running for weeks and thus the refresh is inhibited. The user turns off the computer. During this process, snapd notices that no processes are running for this snap and starts to refresh. The system kills snapd and shuts off before that is complete. You need to at the very least inhibit the shutdown via system while the refreshes are happening and then uninhibit when done. However, there are UX concerns cause the user might hold down the power button if the refreshes during the inhibited shutdown are taking too long, destroying the data in the new revision. Possibly better is if on shutdown, userd alerts and/or prompts and lets the user know/decide snapd is refreshing snaps. (Point is, I don’t think this can be more than an experimental feature without thinking through all these interactions).

zyga-snapd · December 11, 2018, 8:00pm

I agree there are user considerations here but this feels like something on the road ahead and not in the initial implementation. Note that the feature won’t be enabled by default yet. Like everything else we do it will be hidden behind a feature flag until we are happy with how it behaves.