Process lifecycle on snap refresh

To follow-up on an email from the snapcraft mailing list I am wondering why, on snap refresh, an existing running snap is terminated. The original problem is that if a running process is still doing ‘something’, the example given is a QEMU instance but it could really be anything, it gets terminated regardless, one way or another. The snapd code tackles a refresh by:

SIGTERM all things and SIGKILL if could not after a timeout on update

Is this necessary or could we keep the old process around until it naturally comes to a halt. We keep up to 3 revisions of each snap so in theory everything is still around. The offered alternative in the mail is:

PRE: content in /snap/app/oldver/foo
UPGRAD adds: /snap/app/newver/foo
UPGRADE changes: /snap/app/current is set to newver
But /snap/app/oldver/foo would stay around and running applications kept
alive.
Only once the last one is gone /snap/app/oldver would completely vanish.

The last step would only happen if we have more than 3 revisions on the file system otherwise oldver would persist.

Would this work?

If all libraries for foo still refer to oldver it should. This gets tricky with daemons though.

It seems it would be a backward incompatible change especially for daemons, it seems we would need to think how to let snaps have an opinion what should happen at refresh

I think that this kind of a problem affects daemons that can manage other processes via versioned APIs. In this case, libvirt doesn’t care which QEMU version to talk to - it will connect to a socket and find out what it needs via a QEMU Machine Protocol. Libvirt can be updated separately and may use a new squashfs while the old QEMU processes will still use the old one.

Not every application is resilient to termination at a random time, so even if we are not talking about virtual machines or containers in particular, there are at least two issues with regards to abrupt termination:

  1. Generic in-memory state is lost - you don’t know if an application was in the middle of some processing activity;
  2. In-memory libc (or other core runtime) state is lost: in particular, the standard C library buffers which have not been flushed.

My impression is that we need to generalize the approach to updates though this will complicate the snapd implementation.

What’s on my mind is:

  1. Allowing certain processes belonging to a snap services’ cgroups to keep old squashfs file systems mounted until they terminate themselves;

  2. Adding filtering capabilities (e.g. certain /proc/[pid]/cmdline patterns) for defining logical process groups;

  3. Update scheduling, e.g. only update a day after or a week after a new version is available;

  4. Not all applications react to SIGTERM in such a way that they shut down gracefully after that: we need a mechanism that would allow running a different signal or a command to gracefully shutdown an application on update.

How about an refresh hook. The hook could, if defined, inform snapd about a special upgrade preference:

  • just restart services automatically, this is also the default if no hook is provided
  • opportunistically restart services (i.e. let them know we want to refresh) but don’t kill them; this could allow VM software to migrate services to other hosts and coordinate amongst a cluster.

I recall we had some discussion on supporting SIGHUP-style refreshes. Imagine you have a application that can gracefully restart itself with zero downtime observed from the outside (FD passing and tricks like that). As a snap you’d like to be able to refresh it maintaining that property.

The tricky part in all the cases is recycling older revisions (and to some degree the fact that interfaces are not per revision but global). There is also some intersection with this feature and snap-update-ns and the case when core snap is itself updated. To see a new core snap on a classic system you have to restart your machine. We could fix that easily by discarding all preserved namespaces after core refresh and forcing all snaps to restart but this is tricky as we don’t know if any long running command exists (that is not a service).

It might work.

If we replace StopSnapServices with a custom action that says ‘terminate libvirt, don’t touch qemu’ it should be fine so long as the new libvirt instance knows where to find old QEMU unix sockets and the old squashfs file systems are not touched so that QEMU processes use the old files.

The problem is that we would allow a third-party snap to execute custom code with snapd’s privileges on update - doesn’t seem right (though I am sure there is a workaround for that).

Also, I’d look at how this maps into behavior of other applications but so far our concern was process lifetime and inodes.

Not quite, all the hooks run confined :slight_smile:

The process is terminated because the filesystem under the application is changing in several ways If we choose to leave processes around running on old executable code and old data and old filesystems and old revisions and old configurations, the amount of complexity one needs to understand for every trivial update will be overwhelming, and the outcome apparently very wrong: when people update, in general they want to see the new version running.

So instead of spending time thinking about how to not restart processes, it seems wiser to spend time mitigating the effects of actually restarting the processes.

I realize that it adds a lot of complexity but not doing something like this means that the whole class of applications is going to be hardly adaptable to the snapd behavior. Or, in other words, nobody is going to use snaps for production use-cases with those applications.

Libvirt had support for restarting a daemon without restarting VMs for quite a while. Again, forcefully restarting a VM will lead to file system corruption in addition to the lost application state - no user will tolerate this. You can’t even predict if it is going to come up afterwards.

A restart of LXD does not lead to all containers being restarted (unless packaged as a snap).

Pretty much anything that requires SLA will be affected: if you have no control over process lifetime you cannot guarantee anything.

There may be critical updates requiring a timely response but not every update requires immediate uncontrollable action.

I will give a rather philosophical statement here: applications are generic and can be complex, and a decision of whether making an update or not requires some analysis. This analysis can be done by a human which might be more thoughtful but does not scale well and may not be timely, or it can be done automatically. Automatic updates require some hard-coded logic: we currently have the simplest possible form of it: kill & restart. Ideally, to replace a human we need an AI that would do this job (say, neural networks are good because they will make decisions in a constant amount of time after learning but they have their limitations and are not suitable here). Realistically, there is usually something in the middle: a rule-based description of what to update, when and how.

In my view, trying to use one simple approach for every possible application out there is wrong.

An extreme example: a real-time system that uses in-memory state to adjust parameters of medical equipment (life support). It can be restarted and may even start operating properly but you simply don’t know if a temporary service interruption is going to affect the system in a dangerous way.

The bottom line is that I would give workarounds for such cases, otherwise people will stick to traditional packaging methods for some applications.

1 Like

I think you heard the part of restarting applications without reading the part of mitigating problems.

We also don’t kill & restart applications. We ask them to stop, giving them a choice of how to, which is a common practice.

We need to be a bit less fearful of discussing ideas for improving what we have today, without assuming a bad design upfront. The bad design is bad. Let’s look for the good one.

Sorry, if my message was too negative :^)

I am not saying the current design is bad: rather, it is not universal in its current form.

I can think of Docker as a good example: it requires an application to be ‘cloud-native’. Without ten-dollar words, they say: “rewrite your application to work with our sandboxing mechanism”.

We do at first but then after a certain timeout the processes are SIGTERM-ed and SIGKILLed after the second timeout as far as I can see:

That’s also common and good practice. If we ask for something to stop politely, and it refuses to stop politely after a grace period, we need to kill it rather than have a system completely out of control because the snap or the software was poorly written. That said, we should allow the snap to have a saying about what a reasonable grace period is for its context, up to a proper limit.

For the case you describe, with VMs that should remain alive independently of the control daemon, we’ll need to exchange some more details so we can try to come up with a good design. It’ll probably involve separating out the concepts into pieces which are independently and predictably refreshed, but we need to dig into details.

The idea is quite simple: as a daemon libvirt creates QEMU processes which get daemonized themselves.

apps:
  libvirt-bin:
    daemon: simple

Parent pid is 1 which means that a QEMU process was reparented to the init via a standard daemonization procedure:

ps -o ppid `pgrep -f qemu`
 PPID
    1

It is does, however, belong to one of the cgroupv1 hierarchies (the one that systemd uses to group service-related processes in particular). The hierarchy in question:

grep -P 'cgroup.*?systemd' /proc/mounts
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 0 0

The pid of a qemu process is present in the list of tasks for that child cgroup.

grep `pgrep -f qemu` /sys/fs/cgroup/systemd/system.slice/snap.libvirt.libvirt-bin.service/tasks
16527

For that reason systemd considers it a part of a service.

But having QEMU processes in the same ‘systemd’ cgroup is fine - with debs it is done the same way.

After looking at a definition of a unit for a regular libvirt deb package I can see that KillMode=process is used which means that only the main pid of a process is killed. A systemd unit for lxd uses the same option.

sudo systemctl cat libvirt-bin | grep KillMode
KillMode=process

systemctl status libvirtd.service | grep Main
 Main PID: 3474 (libvirtd)

https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=

The definition I get for a snapcraft-generated systemd unit file does not have that option:

systemctl cat snap.libvirt.libvirt-bin.service 
# /etc/systemd/system/snap.libvirt.libvirt-bin.service
[Unit]
# Auto-generated, DO NOT EDIT
Description=Service for snap application libvirt.libvirt-bin
Requires=snap-libvirt-x1.mount
Wants=network-online.target
After=snap-libvirt-x1.mount network-online.target
X-Snappy=yes
[Service]
ExecStart=/usr/bin/snap run libvirt.libvirt-bin
Restart=on-failure
WorkingDirectory=/var/snap/libvirt/x1
TimeoutStopSec=30
Type=simple
[Install]
WantedBy=multi-user.target

So it appears to be that this particular problem could be resolved by allowing modifications to systemd service definitions.

Found a similar request by ‘stub’:

Per the rationale above, we can’t simply allow processes that are using files in an outdated and removed snap revision to go on, since the data under them has changed and this ends up in tears more often than not.

One idea I’d like to explore is splitting up the definition into two different snaps: one of them would be responsible for the runtime, and the other would be responsible for the actual VMs. This way, you can update the runtime as much as you want without stopping the VMs, but at the same time we would still have a good handle on being able to stop the VMs and even removing/resetting them whenever necessary via a convenient and well known mechanism.

Shall we explore that idea further?

I thought about splitting this into two snaps initially: libvirt and qemu. But even if I could implement some mechanism to have QEMU processes start in a different service (cgroup), it would still be pretty much the same: if you update QEMU, the VMs go away.

A runtime option (if I understand it right) is interesting but I don’t see how to do it yet without changing the application too much (which is a no-go in case of libvirt or docker while it could be done with lxd).

In my view, we would need to replicate some part of the traditional approach with this. Like I commented above, the way the traditional packaging works is that you unlink the old files and just use the new ones. If a process still uses an old inode it will not be a problem (we are doing unlinking - not writing over):

A demo:

➜  /tmp echo 'oldstuff' > testfile
➜  /tmp ipython

In [1]: f = open('testfile', 'r')
[1]  + 31427 suspended  ipython

➜  /tmp ls -l /proc/`pgrep -f ipython`/fd/ | grep tmp
lr-x------ 1 dima dima 64 апр 26 17:21 11 -> /tmp/testfile

➜  /tmp echo 'newstuff' > newfile 
➜  /tmp mv newfile testfile 

➜  /tmp ls -l /proc/`pgrep -f ipython`/fd/ | grep tmp
lr-x------ 1 dima dima 64 апр 26 17:21 11 -> /tmp/testfile (deleted)

➜  /tmp cat testfile 
newstuff
➜  /tmp cat /proc/`pgrep -f ipython`/fd/11
oldstuff

So, via procfs I can still access the old inode and its contents. While if a different application or an updated instance of the same application uses this path to get a file descriptor, it will get it for a different inode.

Even though these entries at /proc/[pid]/fd are symlinks, they are “special” - if they point nowhere (deleted), you can still open them preventing the inode from being deleted) and still read/write data. This is why cat works.

The same mechanism works for mmap(2): before shared objects are mmaped into a virtual address space of a process, these files need to be opened:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

If you don’t unlink a file but overwrite it, then the program will be corrupted as if it does page faults for whatever was not in memory at any given point, it will get contents of a new file:
https://www.tablix.org/~avian/blog/archives/2008/03/linux_mmap_weirdness/

For QEMU, which has a lot of dependencies in a form of shared libraries, this is very important. Basically, you have to keep an old binary and the old shared libraries in order to keep VMs intact.

QEMU does not use ANY config files - everything is passed via CLI options (by libvirt or you if you just launch it manually), so, technically, we don’t have any external state for QEMU in a form of files except for shared libraries. Libvirt has XML domain configs and other stuff but we don’t really care as libvirt and QEMU are talking like a client and server via a unix socket.

I would generalize the problem as: how do we snap client-server applications where clients and servers communicate via a unix socket as a medium (or any other posix IPC mechanism).

To answer that question I would think about the idea of controlled lifetime for snaps and transferring a process to a separate ‘group’ which has certain lifetime guarantees by snapd. Right now this is all automatic.

Stephane also has some requirements for differentiating the shutdown or uninstall events:

With the above behavior in mind does anyone have suggestions on how to implement the runtime idea?

That’s still arguing about how to keep applications running when everything under them is significantly changing. Yes, we can make it work, in the sense of the application not crashing immediately, but very often this leads to serious bugs, non-determinism, and bad user experience – a system which is all over the place in non-obvious ways.

So, can we please spend some time on playing with the idea of how to map these needs into a world where each snap may restart its processes on updates, but different snaps may keep the workloads running?

I will give it some thought.

I just keep bumping into scenarios that make it hard to do - so it would be helpful for other people to help architect this.

Let’s forget about qemu and libvirt for a minute. Say, we are talking about lxd or docker which face the same issue.

LXD has a daemon process and monitor processes:

lxd       3175     1  3174  3174  0 апр24 ?     00:00:01   dnsmasq --strict-order --bind-interfaces --pid-file=/var/lib/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --interface=lxdb
r0 --listen-address=10.122.52.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/lib/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/lib/lxd/networks/lxdbr0/dnsmasq.hos
ts --dhcp-range 10.122.52.2,10.122.52.254,1h -s lxd -S /lxd/ -u lxd
root      3216     1  3216  3216  0 апр24 ?     00:00:00   [lxc monitor] /var/lib/lxd/containers juju-58445f-0
165536    3248  3216  3248  3248  0 апр24 ?     00:00:01     /sbin/init
165536    3362  3248  3362  3362  0 апр24 ?     00:00:00       /lib/systemd/systemd-journald
165536    3377  3248  3377  3377  0 апр24 ?     00:00:00       /lib/systemd/systemd-udevd

You can restart a daemon without touching monitor processes.

And this is the same package - so I cannot just take those out to a separate snap. There is probably not even a separate binary for that.

Same with docker:

We’re conflating two different things here I think. Restarting a daemon without killing everything is of course fine and a good idea. It’s the refresh case that has a number of associated displacements that is problematic to not have a complete halt while it’s taking place, as far as the processes depending on the old state are concerned.

In fact, the Docker documentation agrees with me, right there on that page:

If you skip releases during an upgrade, the daemon may not restore its connection to the containers. If the daemon is unable to restore the connection, it ignores the running containers and you must manage them manually.

If you kill -9 PostgreSQL, rather than wait for it to shut down gracefully, the database will go into recovery mode which could take it out of operation for a long period. PostgreSQL is well written software, and as such takes great care to shutdown gracefully and minimize downtime. On larger deployments this can take some time.

Can SIGKILL behaviour be configurable? Terminating PostgreSQL with SIGKILL is against best practice and unacceptable in most production deploys, where it is better to leave the system running until the problem can be manually dealt with. The recommended systemd service files for PostgreSQL disable using SIGKILL for this reason.

This sort of policy is now generally encoded in systemd service files, so maybe snapd should expose more systemd configuration options and rely on systemd to restart… eventually, maybe even after the system is rebooted. As a user and sysadmin, I would prefer snapd to be blocked than have it overrule my policies and terminate my user facing services.

but under classic confinement this is not a problem right?