Process lifecycle on snap refresh

niemeyer · February 5, 2018, 6:00pm

Okay, after may conversations on the topic in various meetings, we’d like to revisit this and push something forward.

Here is a new proposal that takes into account those conversations:

Problem statement

It is reasonable for some workloads to remain running across refreshes. The typical case is managers of containers or virtual machines that want to remain alive while the software that enables them to run gets updated.

Proposal

We’ll introduce a new option that can be used under the application definition scope in snap.yaml and snapcraft.yaml:

apps:
    myapp:
        command: ...
        daemon: ...
        refresh-mode: [ restart / endure ]

The default for this option is restart for daemons and endure for other applications. Initially only changing daemons will be accepted, but we plan to support changing that for other applications too.

As implied, when a daemon is set to endure it won’t be restarted during a refresh. Additionally:

It will lose write access to its per-revision data ($SNAP_DATA)
It won’t lose access to the data common across revisions ($SNAP_COMMON)
The current symlink will be moved to the new revision as usual
Pre and post refresh hooks will work as usual
The security profiles will be updated (see caveats below)

Caveats

In the first implementation of this feature the security profiles for the application will be updated during the refresh, meaning the old process will get the same permissions as the new process. This is not ideal, but it’s a reasonable compromise to get something working sooner, and can be addressed in a future release without breakages.

Comments

How does that sound? Anything we should watch out for or consider further around this feature?

kyrofa · February 5, 2018, 6:05pm

Just to clarify, this particular item will not apply to classically-confined snaps, correct?

jdstrand · February 5, 2018, 6:23pm

I think you mean only write access here. read has always been permitted and should be across refresh.

jdstrand · February 5, 2018, 6:24pm

Classic snaps aren’t strictly confined and the thing the makes snaps lose write access to SNAP_DATA is AppArmor, so you are correct, this would not apply to classic snaps.

sherman · February 6, 2018, 7:52am

does it mean only “snap refresh” command will work this way?

sherman · February 6, 2018, 7:58am

also in the case of restarting the daemon in order for update, is there anyway to avoid killing the entire cgroup, just like we set “killMode=process” in systemd service config?

niemeyer · February 6, 2018, 1:56pm

Yes, that’s the idea. Is that an issue for a case you’re interested on?

One of the side effects of changing kill mode in systemd is that the behavior of service termination is altered altogether, and it makes no distinction of what the reason for stopping is or what the actual processes involved in the operation are.

In other words, in general when we say processes that should not be terminated, we have a very specific use case in mind, but over the life time of a non-trivial application usually multiple processes are spawned with particular purposes. In a well-behaving system, when a snap is removed or disabled altogether that snap should take out with it any left over applications that are still in the system, otherwise the administrator loses control of what is running and why.

The case of refreshes is different, though, because the intention wasn’t really to stop or disable anything, but rather to bring an update into it. In those cases it sounds reasonable to keep specific workloads running that are aware of the constraints at play.

Also note that if the proposed refresh-mode: endure option is used, the snap is still free to do whatever it wishes with its own daemon processes. For example, it’d be trivial to call kill -TERM $SOME_PID inside the post-refresh hook, and get the same effect in refreshes of the KillMode=process systemd option.

@sherman Does that sound reasonble?

sherman · February 7, 2018, 10:52am

we’d like “snap install local_snap.snap” to work this way as well.

sherman · February 7, 2018, 11:07am

I agree that this really applies to our specific use case, but falling back to kill mode default is also “altered the termination altogether” because it kills all. IMO snap is making the kill mode stiffer, while systemd is already not fine-grained on this. I didn’t see service kill mode can anyhow negatively impact snapd even if a snap’s orphan process tracking older “$SNAP_DATA” or “$SNAP_COMMON”, and if that orphan is tracking global data like classic confinement, and guarantee to exit at a reasonable time, it’s a safe move.

sherman · February 7, 2018, 11:16am

yes, but the daemon “process” itself cannot be “kill -TERM $SOME_PID” of that “SOME_PID” right? As I understand this proposal, the daemon process, which is spawn from the apps “command”, will never be touched throughout “snap refresh”, nor can it be killed/restarted from post-refresh hook right? Or is there anything I missed out here?

niemeyer · February 7, 2018, 12:15pm

That’s reasonable, and we’ll make sure it does respect the restart settings.

When installing a local snap over an existing one, that’s more of a refresh operation than an install one. We should really clarify that in the API.

Per note above, the issue with changing kill mode altogether is that this affects removals as well. Being a package manager, we want to make sure the machine is as clean of side effects as possible when a remove operation takes place.

That said, I we very much want to handle your use case too, and these are not in conflict.

Here is a proposal extending the original idea above:

apps:
    myapp:
        command: ...
        daemon: ...
        refresh-mode: [ endure / restart / sigterm[-all] / sighup[-all] / sigusr1[-all] / sigusr2[-all] ]

That means, for example, that if the snap defines refresh-mode: sigterm on refreshes and local installs over an existing snap, the main process will receive a SIGTERM signal, which I think is what you want.

Does that address your needs?

sherman · February 8, 2018, 2:57am

yes, exactly, that’ll make the behavior identical to our existing systemd processes

niemeyer · February 8, 2018, 3:00am

Fantastic, thanks for confirming. We’ll implement it and ping you again once we have something ready for testing, if that’s okay. Expect something early next week the latest.

niemeyer · March 8, 2018, 9:00am

@sherman Sorry for the lack of feedback here. This feature has been in edge for a while, and ready for testing. The design is per the discussion above, so refresh-mode: sigterm is supposed to do what you want.

Please let us know how it goes.

meloniam · March 20, 2018, 2:07am

Thank you for the changes made. I wanted to confirm if there is any change that would be needed to be made in the config for this to work?

niemeyer · March 21, 2018, 12:31pm

Hi Meonia,

The only configuration change necessary to activate the behavior is adding the refresh-mode field under the desired daemon stanza in snapcraft.yaml.

There’s a small problem here which is that snapcraft itself doesn’t yet understand the new option, so it will complain when attempting to package it.

The upcoming release should have it, but meanwhile there is a simple way to test it: after running snapcraft, change the file meta/snap.yaml file inside the prime directory that is created. That file looks a lot like snapcraft.yaml since many of the options are just passthrough into this file, so it should look familiar.

After adding the new option in the daemon, there are two simple ways to test it:

Create the snap with: snap pack prime . # The . is the target dir
Use snap try prime, and snapd will “install” the prime directory into the system as if it was an actual snap; that’s a very convenient way to experiment with changes before the snap is final, but note that changes to snap.yaml require running the command again

If you run into any issues, we’re happy to assist. We can also help you testing that change in our end as well, if you hand us the snap and describe which daemon should receive the new option and what behavior you expect.

Please let us know how it goes.

mvo · April 17, 2018, 2:19pm

Here is the latest version of the process lifecycle work we have in 2.32.5:

We support for each snap app that is a daemon:

refresh-mode: {endure,restart} which controls if the app should be restarted at all
stop-mode: {sigterm,sigterm-all,sighub,sighub-all,sigusr1,sigusr1-all,sigusr2,sigusr2-all} which controls how the daemon should be stopped. The given signal is send to the main PID (when used without -all) or to all PIDs in the process group when the -all suffix is used.

So you you just want to restart your daemon but keep all the children alive you can just use “stop-mode: sigterm”.

hnakamur · September 16, 2019, 12:48pm

Hi, I’d like to send a different signal for refresh from stop-mode.
It is needed to do graceful restart on upgrading to a new version of the package.

For example, nginx rpm send SIGTERM on normal stop, but send SIGUSR2 for version upgrade.
https://src.fedoraproject.org/rpms/nginx/blob/master/f/nginx-upgrade

So, how about adding values like ‘sigusr2’ to refresh-mode?
The following example means send SIGTERM on normal stop and SIGUSR2 on refresh.

refresh-mode: sigusr2
stop-mode: sigterm

Thanks in advance!

lucyllewy · September 16, 2019, 12:59pm

signals that do not stop the process are not very useful, because when a snap is updated it is effectively a separate executable so you need to stop the old process and start with the new executable. Sending a SIGUSR1 signal won’t normally stop a process so the previous release will still be in-use until you reboot or manually restart the service.

hnakamur · September 16, 2019, 2:09pm

For example, in nginx there are master process and child processes (worker processes).
Usually the code for the master process is not modified, only the code for the child processes
is modified.

So it is OK for the old master process in the previous release will remain.

Same is true for my graceful restart library https://github.com/hnakamur/serverstarter

For not dropping request during update, the master process must keep listening and accepting on the socket while upgrading to new version.

And we need to notify the master process to create new child processes of the new version and
then stop old child processes of thew old version.