Stop commands and snapd package cleanup

mborzecki · October 14, 2019, 10:38am

This topic was initially about extra mounts that were left behind by lxd tests inthe integration test suite. While investigating, I arrived at the conclusion that snapd Ubuntu/Debian pacakging is somewhat at fault.

In the integration test error logs, I’ve observed some extra mounts under $SNAP_COMMON/ns. Those only appear when the daemon starts after a reboot or snap installation (simply run snap start lxd.daemon):

--- before      2019-10-14 09:42:35.443008793 +0000
+++ after       2019-10-14 09:42:19.971088103 +0000
@@ -31,3 +31,6 @@
 /snap/lxd/12181                 /dev/loop1             squashfs   ro,nodev,relatime
 /run/snapd/ns                   tmpfs[/snapd/ns]       tmpfs      rw,nosuid,noexec,relatime,size=149464k,mode=755
 /run/snapd/ns/lxd.mnt           nsfs[mnt:[4026532159]] nsfs       rw
+/var/snap/lxd/common/ns          tmpfs                  tmpfs      rw,relatime,size=1024k,mode=700
+/var/snap/lxd/common/ns/shmounts nsfs[mnt:[4026532160]] nsfs       rw
+/var/snap/lxd/common/ns/mntns    nsfs[mnt:[4026532159]] nsfs       rw

It is not required to have any containers running to observe this. The mounts go away when the lxd.daemon is stopped via snap stop. I’ve tracked the cleanup code to this place in LXD snap packaging:

github.com

lxc/lxd-pkg-snap/blob/latest-edge/snapcraft/commands/daemon.stop#L120-L121


if "${SNAP}/share/openvswitch/scripts/ovs-ctl" stop; then
    echo "==> Stopped Open vSwitch"

Things seem to work pretty well, until one removes the snapd package. On Ubuntu/Debian the cleanup is run in postrm hook, and triggered only by --purge. The snapd packaging, attempts to clean up the system and proceeds to stop any snap services, mounts and eventually removes the snap data. At this point our integration test suite fails with this error:

Removing snap lxd and revision 12100
+ [ -d /snap/bin ]
+ find /snap/bin -maxdepth 1 -lname lxd -delete
+ find /snap/bin -maxdepth 1 -lname lxd.* -delete
+ rm -f /snap/bin/lxd
+ rm -f /snap/bin/lxd.benchmark /snap/bin/lxd.buginfo /snap/bin/lxd.check-kernel /snap/bin/lxd.lxc /snap/bin/lxd.migrate
+ umount -d -l /snap/lxd/12100
+ true
+ rm -rf /snap/lxd/12100
+ rm -f /snap/lxd/current
+ rm -rf /var/snap/lxd/12100
+ rm -rf /var/snap/lxd/common
rm: cannot remove '/var/snap/lxd/common/ns/mntns': Device or resource busy
rm: cannot remove '/var/snap/lxd/common/ns/shmounts': Device or resource busy

I’ve have debugged this a little further. In this scenario, the stop code of LXD daemon runs without snapd. The service unit stop fails:

Oct 14 10:10:43 ubuntu lxd.daemon[3223]: => LXD is ready
Oct 14 10:10:56 ubuntu systemd[1]: Stopping Service for snap application lxd.daemon...
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=203
Oct 14 10:10:56 ubuntu systemd[1]: Stopped Service for snap application lxd.daemon.
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Unit entered failed state.
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.

The cleanup code is not reached, and mounts are left behind, cluttering the host namespace.

Digging further, the service is defined as:

[Service]
ExecStart=/usr/bin/snap run lxd.daemon
SyslogIdentifier=lxd.daemon
Restart=on-failure
WorkingDirectory=/var/snap/lxd/12181
ExecStop=/usr/bin/snap run --command=stop lxd.daemon
ExecReload=/usr/bin/snap run --command=reload lxd.daemon
TimeoutStopSec=600
Type=simple

By the time this runs in postrm deb hook, there’s no /usr/bin/snap to execute since package files are gone at this point.

This indicates there’s a problem with snapd packaging and how that affects the snap services. I think we should stop and disable the service units in prerm. This is already done by a call to snap-mgmt --purge in Fedora/Arch/openSUSE packaging.

mborzecki · October 14, 2019, 1:32pm

As discussed after the standup, we’ll look into stopping & disabling services, timers and sockets in the prerm hook. This will keep the snap data around until postrm (–purge) runs, but the services and whatever triggers they may have should othewise be inactive.