Bug? Saves are blocked to $SNAP_USER_DATA if snap updates when it is already running

zyga-snapd · December 11, 2018, 4:25pm

I don’t know of a single distribution that ships v2. We don’t support v2. v2 doesn’t support freezer controller yet. In v2 we don’t need the release agent at all since any cgroup hierarchy has an eventfd(2) based notification mechanism that is far more superior.

mborzecki · December 11, 2018, 4:30pm

Fedora:

[guest@localhost snapd]$ mount|grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)

Arch:

maciek@galeon:~/work/canonical mount|grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

AFAICT systemd is only managing stuff under cgroup2 mount these days.

zyga-snapd · December 11, 2018, 4:32pm

You have the unified hierarchy but we cannot care less about it. The v1 hierarchy still exists on Fedora 29. The day v2 becomes shippable as a replacement for v1 we can start to look at supporting it.

jdstrand · December 11, 2018, 4:33pm

I think we will want to unconditionally write to syslog though. Since we are waiting and not logging/exiting, this should work fine (in addition to stdout/stderr on a PTY).

Should this be /run/snapd/run*ctl*/snap.$SNAP_INSTANCE_NAME.inhibited? Perhaps not since you are using an inotify on the dir, but it feels slightly odd to have running and runctl.

Also, I wonder if people will abuse this as a crude form of refresh control…

Can you update to say that ‘refresh-pending’ here is the new hook?

What do you mean by “linking the new revision” in this context? The data copy? Moving the “current” symlink (which implies the data copy happened)? After answering, can you update the spec to be more specific on this point?

This isn’t clear-- are you saying 'Behavior on revert of snapd to a version that doesn’t support postponing, or something else? Probably would want to think through scenarios of a snapd revert to a version that still supports postponing since we may want the refresh to happen unconditionally then. Eg, browser snap is running in the user’s session for weeks, etc and the refresh is delayed the whole time, then then user reboots. In this scenario, we want to be careful about the refresh being interrupted during the shutdown since the user may have stopped the browser and rebooted (perhaps snapd tells systemd to inhibit shutdown until done?). Also, if for some reason we are in the startup after a shutdown, if there are any pending refreshes (can this happen with the above inihibit logic?), process them so the user can’t login and start the browser before snapd has a chance to refresh.

One thing that is missing as a statement as to how this interacts with daemon reloads. Ie, right now this is no problem with snapd refreshing a snap with daemons, since those daemons are automatically stopped and restarted. Where does this stop/start logic fit in with the above? How does this interplay with user session autostart (maybe not at all; but please call it out. It is possible for something to be autostarted that doesn’t show any graphical elements (so the user may not be aware of them anymore), so these might only ever be refreshed on reboot) as well as the upcoming dbus session services work from @jamesh?

Lastly, there is no discussion of reminding the user that an update is pending. I know that there was feedback that this is not desired, but as a countpoint, I think a user does want to know since people don’t want to be running an insecure browser, for example. The trick is being tasteful. Whether we force the issue and kill the processes after some threshold I think needs a conversation, but that discussion can happen later. I do think that reminders should be in earlier iterations on the feature though.

zyga-snapd · December 11, 2018, 4:35pm

Yes, I meant runctl. I think it’s file to have both of those files in one directory since the volume of traffic is extremely low.

I accidentally used the same name for a hypothetical new hook and snap refresh property. To be clear: the hook would notify an snap about a pending refresh (perhaps offering the snap to kill itself via snapctl), the property would control the behaviour of starting new application processes during a refresh. Those are distinct and have different purpose.

I mean that after the new revision has launchers and the current symlink on disk. I will update the spec to be more accurate.

Yes, that’s what I meant.

I don’t see why we’d want to make the refresh happen unconditionally. Do note that snapd is exempt from this process since it doesn’t run as a snap application.

For a case like a desktop app running for long amount of time I do believe we need refresh-pending hook being invoked in the context of user session to notify the application.

The idea with refreshing during reboot was similar to how some distributions use systemd for over-reboot refreshes in single user mode. We could mimic some of that.

In the initial version of this work I’m not specifying that accurately since I don’t believe we will implement it straight away.

If you mean systemd reloading itself, I don’t believe it does. If you mean snapd reloading itself, I don’t believe it does either. If you mean snapd restarting services across a refresh it doesn’t have any special handling because I don’t believe it needs any.

As outlined above: snapd keeps stopping services before refresh, keeps starting them after refresh. The new element is that the actual refresh (and associated data copy) will only complete once all processes are terminated. This specifically includes processes that not necessarily have terminated since they were not services.

I believe we might need more knobs to control this for special cases like LXD running long-running containers. Those may need some exceptional settings. What I’m describing is aimed at improving the desktop application data consistency across refreshes. It will also work well with snaps that don’t do anything fancy with their refresh mode and just stop and start services as usual.

The only interplay is that I wish we had support for user-session hooks so that we could notify applications that have such capability to recommend to their users to perform a restart to complete the upgrade.

I believe this is answered above with the refresh-pending hook. This is specifically not included in the current plan since we don’t have support for user session hooks yet.

jdstrand · December 11, 2018, 7:27pm

I was talking about regular refreshes for snaps with daemons. You say here “as outlined above” but I didn’t see it the first time and I still don’t see it (am I blind!?!). My point was that if you don’t think anything needs to be done, just have a single statement in the spec that says as much with (brief) justification as to why. It seems plausibly ok this is fine, but seeing that snap-confine is looking at the ‘inhibited’ file and waiting and snap services are launched via their systemd unit that ultimately ends up calling snap-confine (which would wait if the file was still there), I think it should be called out explicitly that it’s ok.

jdstrand · December 11, 2018, 7:31pm

Well… perform a restart of what? The snap (again, the snap may have a session service that isn’t easily restartable)? The user’s session (which userd could do)? I really think that user services need to be thought through since those processes won’t leave until session restart (or some other new mechanism snapd could implement) and so the refresh will be inhibited, the user might get reminded to stop the snap, won’t be able to, and then the cycle repeats until the user just happens to stop the session.

jdstrand · December 11, 2018, 7:33pm

That is for the snap to opt in. There is also the case of all kinds of desktop applications not refreshing because they don’t implement the hook. userd can provide a notification to restart the session or snaps (see last response though). There are UX considerations here.

jdstrand · December 11, 2018, 7:41pm

zyga-snapd:

Eg, browser snap is running in the user’s session for weeks, etc and the refresh is delayed the whole time, then then user reboots. In this scenario, we want to be careful about the refresh being interrupted during the shutdown since the user may have stopped the browser and rebooted (perhaps snapd tells systemd to inhibit shutdown until done?).

For a case like a desktop app running for long amount of time I do believe we need refresh-pending hook being invoked in the context of user session to notify the application.

The idea with refreshing during reboot was similar to how some distributions use systemd for over-reboot refreshes in single user mode. We could mimic some of that.

In the initial version of this work I’m not specifying that accurately since I don’t believe we will implement it straight away.

I really think this needs to be thought through and implemented in the first iteration. I outlined a problem with the current spec: the browser is running for weeks and thus the refresh is inhibited. The user turns off the computer. During this process, snapd notices that no processes are running for this snap and starts to refresh. The system kills snapd and shuts off before that is complete. You need to at the very least inhibit the shutdown via system while the refreshes are happening and then uninhibit when done. However, there are UX concerns cause the user might hold down the power button if the refreshes during the inhibited shutdown are taking too long, destroying the data in the new revision. Possibly better is if on shutdown, userd alerts and/or prompts and lets the user know/decide snapd is refreshing snaps. (Point is, I don’t think this can be more than an experimental feature without thinking through all these interactions).

zyga-snapd · December 11, 2018, 8:00pm

I agree there are user considerations here but this feels like something on the road ahead and not in the initial implementation. Note that the feature won’t be enabled by default yet. Like everything else we do it will be hidden behind a feature flag until we are happy with how it behaves.

mcphail · December 11, 2018, 8:21pm

My own view is that a refresh should be aborted if the system shuts down. If I’m stopping work on a machine it usually means I need to be somewhere else. Delaying shutdown for a system update I didn’t initiate would be a worrying emulation of a different operating system.

A refresh at reboot would be acceptable. Perhaps the logic could be:

@reboot, snaps marked for refresh have their apps replaced by “wrapper” scripts
- if the user launches an app from an interactive environment, the wrapper script gives the option of running the previous version of the app whilst the refresh runs in the background, but warns that work will not be saved
- if the app is launched by a non-interactive environment such as an autostart, the wrapper script delays launch until the refresh completes
when the refresh is complete, the wrapper scripts are removed and the new version can be launched normally

Possible? Sensible?

jamesh · December 12, 2018, 10:16am

With my user-daemons PR, user services are started via snap run. So they pose the same problem as a long running snap application.

Their state is visible to and can be controlled by the user mode systemd instance though. So it would be possible for some agent running in the user session to trigger restarts on behalf of the system.

pedronis · December 19, 2018, 8:53am

notice that at the point when we copy data by current flow we have removed the current link so conceptually new apps cannot be started, I think to have a consistent story we need to wait earlier actually, before stopping the services and unlinking the current revision.

Regarding the hook, I fear postponing it might not be viable as @jdstrand mentioned. Also do we need to consider also the case of service only snaps? Do we need a global hook as well? Notice also that given the shared nature of snap installations, in theory the snap could be being run from different users at the same time, in some scenarios.

As a minor note refresh-pending as a hook name doesn’t fit our current patterns.

zyga-snapd · February 11, 2019, 10:30am

We discussed this at length a few weeks ago and today I’m commencing implementation.

I will now quote the current design plan:

Simple approach first, covers the most common desktop case:

When user requests a refresh interactively we perform synchronous server side soft verification that the request can go forward, for details about the verification, see below. If verification fails we return a synchronous error response, otherwise we create the usual change and return the async response In the refresh chain, before stopping services we perform another soft verification. If the verification fails but is below the grace period of postponed refreshes we fail the change (due to lanes only the affected snaps will fail to refresh). If the last refresh time is no longer in the grace period we remember to kill all processes and carry on. After stopping services we perform hard verification, if that fails but we are still within the grace period we restart services we’ve stopped and fail the change, as above. If the grace period has elapsed we kill all processes belonging to the snap and proceed with the refresh as usual.

Soft verification - the set of processes belonging to non-service applications is non-empty Hard verification - the set of processes belonging to a given snap is non-empty

We can compute those sets by examining our freezer cgroup process list (/sys/fs/cgroup/freezer/snap.$SNAP_NAME/cgroup.procs) and set of processes belonging to all the services that exist in the snap (by looking at /sys/fs/cgroup/systemd/system.slice/snap.$SNAP_NAME.*.service/cgroup.procs)

Once simple approach is implemented, we can consider several improvements.

We can initiate the refresh process instantly after the last application process terminates using cgroup v1 or v2 notification mechanism.

We can introduce new hooks that notify an application about a pending update

We can introduce session-level hooks via snapd and snap-userd to deliver messages to the session of users that have logged in

We can pre-download the snap and perform the update in a special boot mode, matching similar work on recent desktop and server systems.

A mechanism that allows applications to grab refresh inhibit locks for critical operations for a bound amount of time (independent of the logic above)

This is also represented as snapd bug report Bug #1616650 “snap refresh while command is running may cause is...” : Bugs : snapd

zyga-snapd · February 13, 2019, 11:01am

While working on this I came across a part that should be explicitly designed as it involves user interaction. The specific decision on what to do when in a set of snaps that is being refreshed, some snaps are busy, some snaps were explicitly requested, what should happen: should we refresh partially or reject, and if we refreshed partially how to notify the user.

I think it’s easier to consider several cases as hypothetical user interaction on command line.

Context for examples below

snaps s1, s2, s3 and s4 are installed.
snaps s1 and s2 have pending refreshes.
snaps s2 and s3 have running non-service apps and cannot be refreshed.

Refreshing all snaps implicitly

$ snap refresh
refreshing snap s1
cannot refresh snap s2, it is currently running.

The error message about s2 is task log message, not a warning, because the user will see it and can take action immediately.

Auto-refreshing all snaps

This is exactly the same as above, namely, what can refresh does so, with the exception that the error message becomes a warning that the user can see via the warnings system.

Refreshing a single snap explicitly

$ snap refresh s1

Here snap s1 is outdated and not busy so it can refresh as usual.

$ snap refresh s2
cannot refresh snap s2, it is currently running

Here snap s2 is outdated but busy so it cannot refresh.

The error message about s2 should be a synchronous error from the API call per discussion with @niemayer in Cape Town.

Currently this cannot be done this way (synchronously) because we unconditionally refresh assertions before passing control to Update/UpdateMany. Instead error message can be logged via the task logger as for the case “refreshing all snaps implicitly” above.

$ snap refresh s3
snap s3 is up-to-date

Here snap s3 is up-to-date so we don’t even consider checking if it is busy or not.

Refreshing multiple snaps explicitly

When multiple snaps are refreshed we can complete a subset of the operation, depending on what the state of each requested snap is.

$ snap refresh s1 s2
refreshing snap s1
cannot refresh snap s2, it is currently running.

Here s1 and s2 are outdated but s2 is busy.

The error message about s2 is an asynchronous task log message, not a warning, because the user will see it and can take action immediately. At the same time, because s1 can refresh so the whole operation is a success.

$ snap refresh s2 s3
cannot refresh snap s2, its is currently running.
snap s3 is up-to-date

Here s2 is outdated and busy and s3 is up-to-date.

Here nothing was actually refreshed. The case behaves just as the one above though: the error message is logged asynchronously because we were busy checking for updates for s2 before we determined that it is busy. There are no warnings used since the user can see the message and act upon it.

$ snap refresh s3 s4
snaps are up-to-date, nothing to do

Here both s3 and s4 are up-to-date. We have not even considered if they are busy or not.

chipaca · February 13, 2019, 11:32am

can we tell the user what is actually running? (snaps don’t run)

this does not follow. The assertions refresh is still synchronous, as is the initial part of Update which includes checking the store for updates. The asynchronous portion is started inside snapstate.doUpdate, which is where I’d recommend you add the check.

zyga-snapd · February 13, 2019, 11:34am

Yes, we can tell precisely which apps or hooks are active. We expect that services will go down so we can ignore those in the output. This was not designed for UX so I left this out but we have the technical capacity.

I used “synchronous” to somehow mean “before any network IO which is the slow part”. I really meant the latter.

chipaca · February 13, 2019, 11:37am

that doesn’t make sense. We want to check the store for updates before telling the user to close their app for the refresh. That check is over the network, and is done in the synchronous portion of the request, before initiating tasks. Only after that, when we know we’re updating the snap, should we be checking whether it has things running.

zyga-snapd · February 13, 2019, 11:39am

In some sense it depends on how you word the message. “Cannot refresh foo while it is running” in response to a direct request to refresh foo feels like a sensible thing, even if foo has no updates yet.

Another aspect is how this would look like from something like GNOME Software, where going to the apps page would show you what you can refresh (without doing so yet) and giving you a way to refresh each item individually.

chipaca · February 13, 2019, 11:40am

given we ship GNOME Software as a snap, that’ll be fun.