This topic was initially about extra mounts that were left behind by lxd tests inthe integration test suite. While investigating, I arrived at the conclusion that snapd Ubuntu/Debian pacakging is somewhat at fault.
In the integration test error logs, I’ve observed some extra mounts under $SNAP_COMMON/ns
. Those only appear when the daemon starts after a reboot or snap installation (simply run snap start lxd.daemon
):
--- before 2019-10-14 09:42:35.443008793 +0000
+++ after 2019-10-14 09:42:19.971088103 +0000
@@ -31,3 +31,6 @@
/snap/lxd/12181 /dev/loop1 squashfs ro,nodev,relatime
/run/snapd/ns tmpfs[/snapd/ns] tmpfs rw,nosuid,noexec,relatime,size=149464k,mode=755
/run/snapd/ns/lxd.mnt nsfs[mnt:[4026532159]] nsfs rw
+/var/snap/lxd/common/ns tmpfs tmpfs rw,relatime,size=1024k,mode=700
+/var/snap/lxd/common/ns/shmounts nsfs[mnt:[4026532160]] nsfs rw
+/var/snap/lxd/common/ns/mntns nsfs[mnt:[4026532159]] nsfs rw
It is not required to have any containers running to observe this. The mounts go away when the lxd.daemon is stopped via snap stop
. I’ve tracked the cleanup code to this place in LXD snap packaging:
Things seem to work pretty well, until one removes the snapd package. On Ubuntu/Debian the cleanup is run in postrm hook, and triggered only by --purge
. The snapd
packaging, attempts to clean up the system and proceeds to stop any snap services, mounts and eventually removes the snap data. At this point our integration test suite fails with this error:
Removing snap lxd and revision 12100
+ [ -d /snap/bin ]
+ find /snap/bin -maxdepth 1 -lname lxd -delete
+ find /snap/bin -maxdepth 1 -lname lxd.* -delete
+ rm -f /snap/bin/lxd
+ rm -f /snap/bin/lxd.benchmark /snap/bin/lxd.buginfo /snap/bin/lxd.check-kernel /snap/bin/lxd.lxc /snap/bin/lxd.migrate
+ umount -d -l /snap/lxd/12100
+ true
+ rm -rf /snap/lxd/12100
+ rm -f /snap/lxd/current
+ rm -rf /var/snap/lxd/12100
+ rm -rf /var/snap/lxd/common
rm: cannot remove '/var/snap/lxd/common/ns/mntns': Device or resource busy
rm: cannot remove '/var/snap/lxd/common/ns/shmounts': Device or resource busy
I’ve have debugged this a little further. In this scenario, the stop code of LXD daemon runs without snapd. The service unit stop fails:
Oct 14 10:10:43 ubuntu lxd.daemon[3223]: => LXD is ready
Oct 14 10:10:56 ubuntu systemd[1]: Stopping Service for snap application lxd.daemon...
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=203
Oct 14 10:10:56 ubuntu systemd[1]: Stopped Service for snap application lxd.daemon.
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Unit entered failed state.
Oct 14 10:10:56 ubuntu systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
The cleanup code is not reached, and mounts are left behind, cluttering the host namespace.
Digging further, the service is defined as:
[Service]
ExecStart=/usr/bin/snap run lxd.daemon
SyslogIdentifier=lxd.daemon
Restart=on-failure
WorkingDirectory=/var/snap/lxd/12181
ExecStop=/usr/bin/snap run --command=stop lxd.daemon
ExecReload=/usr/bin/snap run --command=reload lxd.daemon
TimeoutStopSec=600
Type=simple
By the time this runs in postrm
deb hook, there’s no /usr/bin/snap
to execute since package files are gone at this point.
This indicates there’s a problem with snapd packaging and how that affects the snap services. I think we should stop and disable the service units in prerm. This is already done by a call to snap-mgmt --purge
in Fedora/Arch/openSUSE packaging.