Process lifecycle on snap refresh


#18

We’re conflating two different things here I think. Restarting a daemon without killing everything is of course fine and a good idea. It’s the refresh case that has a number of associated displacements that is problematic to not have a complete halt while it’s taking place, as far as the processes depending on the old state are concerned.

In fact, the Docker documentation agrees with me, right there on that page:

If you skip releases during an upgrade, the daemon may not restore its connection to the containers. If the daemon is unable to restore the connection, it ignores the running containers and you must manage them manually.


#19

If you kill -9 PostgreSQL, rather than wait for it to shut down gracefully, the database will go into recovery mode which could take it out of operation for a long period. PostgreSQL is well written software, and as such takes great care to shutdown gracefully and minimize downtime. On larger deployments this can take some time.

Can SIGKILL behaviour be configurable? Terminating PostgreSQL with SIGKILL is against best practice and unacceptable in most production deploys, where it is better to leave the system running until the problem can be manually dealt with. The recommended systemd service files for PostgreSQL disable using SIGKILL for this reason.

This sort of policy is now generally encoded in systemd service files, so maybe snapd should expose more systemd configuration options and rely on systemd to restart… eventually, maybe even after the system is rebooted. As a user and sysadmin, I would prefer snapd to be blocked than have it overrule my policies and terminate my user facing services.


#20

but under classic confinement this is not a problem right?


#21

Okay, after may conversations on the topic in various meetings, we’d like to revisit this and push something forward.

Here is a new proposal that takes into account those conversations:

Problem statement

It is reasonable for some workloads to remain running across refreshes. The typical case is managers of containers or virtual machines that want to remain alive while the software that enables them to run gets updated.

Proposal

We’ll introduce a new option that can be used under the application definition scope in snap.yaml and snapcraft.yaml:

apps:
    myapp:
        command: ...
        daemon: ...
        refresh-mode: [ restart / endure ] 

The default for this option is restart for daemons and endure for other applications. Initially only changing daemons will be accepted, but we plan to support changing that for other applications too.

As implied, when a daemon is set to endure it won’t be restarted during a refresh. Additionally:

  • It will lose write access to its per-revision data ($SNAP_DATA)
  • It won’t lose access to the data common across revisions ($SNAP_COMMON)
  • The current symlink will be moved to the new revision as usual
  • Pre and post refresh hooks will work as usual
  • The security profiles will be updated (see caveats below)

Caveats

In the first implementation of this feature the security profiles for the application will be updated during the refresh, meaning the old process will get the same permissions as the new process. This is not ideal, but it’s a reasonable compromise to get something working sooner, and can be addressed in a future release without breakages.

Comments

How does that sound? Anything we should watch out for or consider further around this feature?


The snapd roadmap
#22

Just to clarify, this particular item will not apply to classically-confined snaps, correct?


#23

I think you mean only write access here. read has always been permitted and should be across refresh.


#24

Classic snaps aren’t strictly confined and the thing the makes snaps lose write access to SNAP_DATA is AppArmor, so you are correct, this would not apply to classic snaps.


#25

does it mean only “snap refresh” command will work this way?


#26

also in the case of restarting the daemon in order for update, is there anyway to avoid killing the entire cgroup, just like we set “killMode=process” in systemd service config?


#27

Yes, that’s the idea. Is that an issue for a case you’re interested on?

One of the side effects of changing kill mode in systemd is that the behavior of service termination is altered altogether, and it makes no distinction of what the reason for stopping is or what the actual processes involved in the operation are.

In other words, in general when we say processes that should not be terminated, we have a very specific use case in mind, but over the life time of a non-trivial application usually multiple processes are spawned with particular purposes. In a well-behaving system, when a snap is removed or disabled altogether that snap should take out with it any left over applications that are still in the system, otherwise the administrator loses control of what is running and why.

The case of refreshes is different, though, because the intention wasn’t really to stop or disable anything, but rather to bring an update into it. In those cases it sounds reasonable to keep specific workloads running that are aware of the constraints at play.

Also note that if the proposed refresh-mode: endure option is used, the snap is still free to do whatever it wishes with its own daemon processes. For example, it’d be trivial to call kill -TERM $SOME_PID inside the post-refresh hook, and get the same effect in refreshes of the KillMode=process systemd option.

@sherman Does that sound reasonble?


#28

we’d like “snap install local_snap.snap” to work this way as well.


#29

I agree that this really applies to our specific use case, but falling back to kill mode default is also “altered the termination altogether” because it kills all. IMO snap is making the kill mode stiffer, while systemd is already not fine-grained on this. I didn’t see service kill mode can anyhow negatively impact snapd even if a snap’s orphan process tracking older “$SNAP_DATA” or “$SNAP_COMMON”, and if that orphan is tracking global data like classic confinement, and guarantee to exit at a reasonable time, it’s a safe move.


#30

yes, but the daemon “process” itself cannot be “kill -TERM $SOME_PID” of that “SOME_PID” right? As I understand this proposal, the daemon process, which is spawn from the apps “command”, will never be touched throughout “snap refresh”, nor can it be killed/restarted from post-refresh hook right? Or is there anything I missed out here?


#31

That’s reasonable, and we’ll make sure it does respect the restart settings.

When installing a local snap over an existing one, that’s more of a refresh operation than an install one. We should really clarify that in the API.

Per note above, the issue with changing kill mode altogether is that this affects removals as well. Being a package manager, we want to make sure the machine is as clean of side effects as possible when a remove operation takes place.

That said, I we very much want to handle your use case too, and these are not in conflict.

Here is a proposal extending the original idea above:

apps:
    myapp:
        command: ...
        daemon: ...
        refresh-mode: [ endure / restart / sigterm[-all] / sighup[-all] / sigusr1[-all] / sigusr2[-all] ] 

That means, for example, that if the snap defines refresh-mode: sigterm on refreshes and local installs over an existing snap, the main process will receive a SIGTERM signal, which I think is what you want.

Does that address your needs?


#32

yes, exactly, that’ll make the behavior identical to our existing systemd processes


#33

Fantastic, thanks for confirming. We’ll implement it and ping you again once we have something ready for testing, if that’s okay. Expect something early next week the latest.


#34

@sherman Sorry for the lack of feedback here. This feature has been in edge for a while, and ready for testing. The design is per the discussion above, so refresh-mode: sigterm is supposed to do what you want.

Please let us know how it goes.


#35

Thank you for the changes made. I wanted to confirm if there is any change that would be needed to be made in the config for this to work?


#36

Hi Meonia,

The only configuration change necessary to activate the behavior is adding the refresh-mode field under the desired daemon stanza in snapcraft.yaml.

There’s a small problem here which is that snapcraft itself doesn’t yet understand the new option, so it will complain when attempting to package it.

The upcoming release should have it, but meanwhile there is a simple way to test it: after running snapcraft, change the file meta/snap.yaml file inside the prime directory that is created. That file looks a lot like snapcraft.yaml since many of the options are just passthrough into this file, so it should look familiar.

After adding the new option in the daemon, there are two simple ways to test it:

  1. Create the snap with: snap pack prime . # The . is the target dir
  2. Use snap try prime, and snapd will “install” the prime directory into the system as if it was an actual snap; that’s a very convenient way to experiment with changes before the snap is final, but note that changes to snap.yaml require running the command again

If you run into any issues, we’re happy to assist. We can also help you testing that change in our end as well, if you hand us the snap and describe which daemon should receive the new option and what behavior you expect.

Please let us know how it goes.


#37

Here is the latest version of the process lifecycle work we have in 2.32.5:

We support for each snap app that is a daemon:

  • refresh-mode: {endure,restart} which controls if the app should be restarted at all
  • stop-mode: {sigterm,sigterm-all,sighub,sighub-all,sigusr1,sigusr1-all,sigusr2,sigusr2-all} which controls how the daemon should be stopped. The given signal is send to the main PID (when used without -all) or to all PIDs in the process group when the -all suffix is used.

So you you just want to restart your daemon but keep all the children alive you can just use “stop-mode: sigterm”.


Handling restart signal
The snapd roadmap