To follow-up on an email from the snapcraft mailing list I am wondering why, on snap refresh, an existing running snap is terminated. The original problem is that if a running process is still doing ‘something’, the example given is a QEMU instance but it could really be anything, it gets terminated regardless, one way or another. The snapd code tackles a refresh by:
SIGTERM all things and SIGKILL if could not after a timeout on update
Is this necessary or could we keep the old process around until it naturally comes to a halt. We keep up to 3 revisions of each snap so in theory everything is still around. The offered alternative in the mail is:
PRE: content in /snap/app/oldver/foo
UPGRAD adds: /snap/app/newver/foo
UPGRADE changes: /snap/app/current is set to newver
But /snap/app/oldver/foo would stay around and running applications kept
Only once the last one is gone /snap/app/oldver would completely vanish.
The last step would only happen if we have more than 3 revisions on the file system otherwise oldver would persist.
I think that this kind of a problem affects daemons that can manage other processes via versioned APIs. In this case, libvirt doesn’t care which QEMU version to talk to - it will connect to a socket and find out what it needs via a QEMU Machine Protocol. Libvirt can be updated separately and may use a new squashfs while the old QEMU processes will still use the old one.
Not every application is resilient to termination at a random time, so even if we are not talking about virtual machines or containers in particular, there are at least two issues with regards to abrupt termination:
Generic in-memory state is lost - you don’t know if an application was in the middle of some processing activity;
In-memory libc (or other core runtime) state is lost: in particular, the standard C library buffers which have not been flushed.
My impression is that we need to generalize the approach to updates though this will complicate the snapd implementation.
What’s on my mind is:
Allowing certain processes belonging to a snap services’ cgroups to keep old squashfs file systems mounted until they terminate themselves;
Adding filtering capabilities (e.g. certain /proc/[pid]/cmdline patterns) for defining logical process groups;
Update scheduling, e.g. only update a day after or a week after a new version is available;
Not all applications react to SIGTERM in such a way that they shut down gracefully after that: we need a mechanism that would allow running a different signal or a command to gracefully shutdown an application on update.
How about an refresh hook. The hook could, if defined, inform snapd about a special upgrade preference:
just restart services automatically, this is also the default if no hook is provided
opportunistically restart services (i.e. let them know we want to refresh) but don’t kill them; this could allow VM software to migrate services to other hosts and coordinate amongst a cluster.
I recall we had some discussion on supporting SIGHUP-style refreshes. Imagine you have a application that can gracefully restart itself with zero downtime observed from the outside (FD passing and tricks like that). As a snap you’d like to be able to refresh it maintaining that property.
The tricky part in all the cases is recycling older revisions (and to some degree the fact that interfaces are not per revision but global). There is also some intersection with this feature and snap-update-ns and the case when core snap is itself updated. To see a new core snap on a classic system you have to restart your machine. We could fix that easily by discarding all preserved namespaces after core refresh and forcing all snaps to restart but this is tricky as we don’t know if any long running command exists (that is not a service).
If we replace StopSnapServices with a custom action that says ‘terminate libvirt, don’t touch qemu’ it should be fine so long as the new libvirt instance knows where to find old QEMU unix sockets and the old squashfs file systems are not touched so that QEMU processes use the old files.
The problem is that we would allow a third-party snap to execute custom code with snapd’s privileges on update - doesn’t seem right (though I am sure there is a workaround for that).
Also, I’d look at how this maps into behavior of other applications but so far our concern was process lifetime and inodes.
The process is terminated because the filesystem under the application is changing in several ways If we choose to leave processes around running on old executable code and old data and old filesystems and old revisions and old configurations, the amount of complexity one needs to understand for every trivial update will be overwhelming, and the outcome apparently very wrong: when people update, in general they want to see the new version running.
So instead of spending time thinking about how to not restart processes, it seems wiser to spend time mitigating the effects of actually restarting the processes.
I realize that it adds a lot of complexity but not doing something like this means that the whole class of applications is going to be hardly adaptable to the snapd behavior. Or, in other words, nobody is going to use snaps for production use-cases with those applications.
Libvirt had support for restarting a daemon without restarting VMs for quite a while. Again, forcefully restarting a VM will lead to file system corruption in addition to the lost application state - no user will tolerate this. You can’t even predict if it is going to come up afterwards.
A restart of LXD does not lead to all containers being restarted (unless packaged as a snap).
Pretty much anything that requires SLA will be affected: if you have no control over process lifetime you cannot guarantee anything.
There may be critical updates requiring a timely response but not every update requires immediate uncontrollable action.
I will give a rather philosophical statement here: applications are generic and can be complex, and a decision of whether making an update or not requires some analysis. This analysis can be done by a human which might be more thoughtful but does not scale well and may not be timely, or it can be done automatically. Automatic updates require some hard-coded logic: we currently have the simplest possible form of it: kill & restart. Ideally, to replace a human we need an AI that would do this job (say, neural networks are good because they will make decisions in a constant amount of time after learning but they have their limitations and are not suitable here). Realistically, there is usually something in the middle: a rule-based description of what to update, when and how.
In my view, trying to use one simple approach for every possible application out there is wrong.
An extreme example: a real-time system that uses in-memory state to adjust parameters of medical equipment (life support). It can be restarted and may even start operating properly but you simply don’t know if a temporary service interruption is going to affect the system in a dangerous way.
The bottom line is that I would give workarounds for such cases, otherwise people will stick to traditional packaging methods for some applications.
That’s also common and good practice. If we ask for something to stop politely, and it refuses to stop politely after a grace period, we need to kill it rather than have a system completely out of control because the snap or the software was poorly written. That said, we should allow the snap to have a saying about what a reasonable grace period is for its context, up to a proper limit.
For the case you describe, with VMs that should remain alive independently of the control daemon, we’ll need to exchange some more details so we can try to come up with a good design. It’ll probably involve separating out the concepts into pieces which are independently and predictably refreshed, but we need to dig into details.
For that reason systemd considers it a part of a service.
But having QEMU processes in the same ‘systemd’ cgroup is fine - with debs it is done the same way.
After looking at a definition of a unit for a regular libvirt deb package I can see that KillMode=process is used which means that only the main pid of a process is killed. A systemd unit for lxd uses the same option.
sudo systemctl cat libvirt-bin | grep KillMode
systemctl status libvirtd.service | grep Main
Main PID: 3474 (libvirtd)
Per the rationale above, we can’t simply allow processes that are using files in an outdated and removed snap revision to go on, since the data under them has changed and this ends up in tears more often than not.
One idea I’d like to explore is splitting up the definition into two different snaps: one of them would be responsible for the runtime, and the other would be responsible for the actual VMs. This way, you can update the runtime as much as you want without stopping the VMs, but at the same time we would still have a good handle on being able to stop the VMs and even removing/resetting them whenever necessary via a convenient and well known mechanism.
I thought about splitting this into two snaps initially: libvirt and qemu. But even if I could implement some mechanism to have QEMU processes start in a different service (cgroup), it would still be pretty much the same: if you update QEMU, the VMs go away.
A runtime option (if I understand it right) is interesting but I don’t see how to do it yet without changing the application too much (which is a no-go in case of libvirt or docker while it could be done with lxd).
In my view, we would need to replicate some part of the traditional approach with this. Like I commented above, the way the traditional packaging works is that you unlink the old files and just use the new ones. If a process still uses an old inode it will not be a problem (we are doing unlinking - not writing over):
So, via procfs I can still access the old inode and its contents. While if a different application or an updated instance of the same application uses this path to get a file descriptor, it will get it for a different inode.
Even though these entries at /proc/[pid]/fd are symlinks, they are “special” - if they point nowhere (deleted), you can still open them preventing the inode from being deleted) and still read/write data. This is why cat works.
The same mechanism works for mmap(2): before shared objects are mmaped into a virtual address space of a process, these files need to be opened:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
For QEMU, which has a lot of dependencies in a form of shared libraries, this is very important. Basically, you have to keep an old binary and the old shared libraries in order to keep VMs intact.
QEMU does not use ANY config files - everything is passed via CLI options (by libvirt or you if you just launch it manually), so, technically, we don’t have any external state for QEMU in a form of files except for shared libraries. Libvirt has XML domain configs and other stuff but we don’t really care as libvirt and QEMU are talking like a client and server via a unix socket.
I would generalize the problem as: how do we snap client-server applications where clients and servers communicate via a unix socket as a medium (or any other posix IPC mechanism).
To answer that question I would think about the idea of controlled lifetime for snaps and transferring a process to a separate ‘group’ which has certain lifetime guarantees by snapd. Right now this is all automatic.
Stephane also has some requirements for differentiating the shutdown or uninstall events:
With the above behavior in mind does anyone have suggestions on how to implement the runtime idea?
That’s still arguing about how to keep applications running when everything under them is significantly changing. Yes, we can make it work, in the sense of the application not crashing immediately, but very often this leads to serious bugs, non-determinism, and bad user experience – a system which is all over the place in non-obvious ways.
So, can we please spend some time on playing with the idea of how to map these needs into a world where each snap may restart its processes on updates, but different snaps may keep the workloads running?
We’re conflating two different things here I think. Restarting a daemon without killing everything is of course fine and a good idea. It’s the refresh case that has a number of associated displacements that is problematic to not have a complete halt while it’s taking place, as far as the processes depending on the old state are concerned.
In fact, the Docker documentation agrees with me, right there on that page:
If you skip releases during an upgrade, the daemon may not restore its connection to the containers. If the daemon is unable to restore the connection, it ignores the running containers and you must manage them manually.
If you kill -9 PostgreSQL, rather than wait for it to shut down gracefully, the database will go into recovery mode which could take it out of operation for a long period. PostgreSQL is well written software, and as such takes great care to shutdown gracefully and minimize downtime. On larger deployments this can take some time.
Can SIGKILL behaviour be configurable? Terminating PostgreSQL with SIGKILL is against best practice and unacceptable in most production deploys, where it is better to leave the system running until the problem can be manually dealt with. The recommended systemd service files for PostgreSQL disable using SIGKILL for this reason.
This sort of policy is now generally encoded in systemd service files, so maybe snapd should expose more systemd configuration options and rely on systemd to restart… eventually, maybe even after the system is rebooted. As a user and sysadmin, I would prefer snapd to be blocked than have it overrule my policies and terminate my user facing services.