Yes, thanks for pinging. I will try to catch up with @stgraber tomorrow.
Iām happy to chat about it in person, but I suspect Iāll sound like a broken record, basically repeating what I said in https://discuss.linuxcontainers.org/t/snapd-cant-remove-old-revisions-when-running-inside-lxd/452
Sounded like @zyga-snapd had a branch which was getting snap-confine to attempt to fix this, though Iām not sure how exactly that would work given that systemd would still be mounting those snaps automatically on boot, quite possibly much before snap-confine itself is called by the first snap starting. Unless thereās some clever systemd dependency ordering going on there somehow?
I remember that my original suggestion for this was to have a snap.mount unit which would have systemd itself do the bind-mount and MS_SLAVE remount of /snap, doing that would have systemd properly order its mount units, guaranteeing that snap.mount is processed before any other directory underneath it.
I donāt remember if you can have the systemd unit declare both the bind-mount + MS_SLAVE remount in one go, but if not, this should be achievable by using a post-start action on the unit, to have it perform the remount.
I did some attempt but I ran into issues with either systemd or with a security review when trying to work around deficiencies in systemd.
The crux of the limitation was indeed that /snap
mount unit is not enough as thereās no way to apply MS_SLAVE this way. I will try your suggestion to have a post-start action that changes sharing.
As one annoying limitation FUSE mounts are not reliably represented in /proc/self/mountinfo
so we cannot unmount and remount them to fix something. We must ask systemd to do that but this is too much power to wield from snap-confine. (this is what my earlier branch attempted).
Did this approach work?
@kyrofa no, not really; we discussed this with @mvo today and thereās another attempt in https://github.com/snapcore/snapd/pull/4517
Iām afraid this may not be fixed, or perhaps thereās another problem. Iām using candidate in LXD:
$ snap version
snap 2.31
snapd 2.31
series 16
ubuntu 16.04
kernel 4.4.0-112-generic
Trying to remove a snap I get this:
$ sudo snap remove nextcloud
2018-02-17T18:08:35Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: [stop
snap-nextcloud-5132.mount] failed with exit status 1: Job for snap-nextcloud-5132.mount failed. See
"systemctl status snap-nextcloud-5132.mount" and "journalctl -xe" for details.
Remove snap "nextcloud" (5132) from the system .^C
ubuntu@nextcloud-proxy-test:~$ snap changes
ID Status Spawn Ready Summary
1 Done 2018-02-17T17:40:32Z 2018-02-17T17:40:32Z Initialize system state
2 Done 2018-02-17T17:42:37Z 2018-02-17T17:43:01Z Install "core" snap from "candidate" channel
3 Done 2018-02-17T17:42:37Z 2018-02-17T17:42:40Z Initialize device
4 Done 2018-02-17T17:43:15Z 2018-02-17T17:44:00Z Install "nextcloud" snap
5 Done 2018-02-17T17:47:04Z 2018-02-17T17:47:06Z Change configuration of "nextcloud" snap
6 Doing 2018-02-17T18:07:57Z - Remove "nextcloud" snap
ubuntu@nextcloud-proxy-test:~$ snap change 6
Status Spawn Ready Summary
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Stop snap "nextcloud" services
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Run remove hook of "nextcloud" snap if present
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Remove aliases for snap "nextcloud"
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Make snap "nextcloud" unavailable to the system
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Remove security profile for snap "nextcloud" (5132)
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Remove data for snap "nextcloud" (5132)
Doing 2018-02-17T18:07:57Z - Remove snap "nextcloud" (5132) from the system
Do 2018-02-17T18:07:57Z - Discard interface connections for snap "nextcloud" (5132)
......................................................................
Remove snap "nextcloud" (5132) from the system
2018-02-17T18:08:35Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: [stop snap-nextcloud-5132.mount] failed with exit status 1: Job for snap-nextcloud-5132.mount failed. See "systemctl status snap-nextcloud-5132.mount" and "journalctl -xe" for details.
ubuntu@nextcloud-proxy-test:~$ systemctl status snap-nextcloud-5132.mount
ā snap-nextcloud-5132.mount - Mount unit for nextcloud
Loaded: loaded (/proc/self/mountinfo; enabled; vendor preset: enabled)
Active: active (mounted) (Result: exit-code) since Sat 2018-02-17 18:08:35 UTC; 42s ago
Where: /snap/nextcloud/5132
What: squashfuse
Process: 12833 ExecUnmount=/bin/umount /snap/nextcloud/5132 (code=exited, status=32)
Tasks: 1
Memory: 1.0M
CPU: 11.157s
CGroup: /system.slice/snap-nextcloud-5132.mount
āā8363 squashfuse /var/lib/snapd/snaps/nextcloud_5132.snap /snap/nextcloud/5132 -o ro,nodev,
Feb 17 17:43:50 nextcloud-proxy-test systemd[1]: Mounting Mount unit for nextcloud...
Feb 17 17:43:50 nextcloud-proxy-test systemd[1]: Mounted Mount unit for nextcloud.
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: Unmounting Mount unit for nextcloud...
Feb 17 18:08:35 nextcloud-proxy-test umount[12833]: umount: /snap/nextcloud/5132: not mounted
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: snap-nextcloud-5132.mount: Mount process exited, code=
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: Failed unmounting Mount unit for nextcloud.
You will need the 2.31-deb based package to get the fix. I. heard that one is coming out soon though.
The snapd 2.31.1 debs release are in *-proposed - install them from there for testing.
This issue should be fixed everywhere now. Please post comments in case you are affected again.
Thank you for the fix! Iām really happy to be able to use snaps on my servers again (everything runs inside LXD). Thank you @kyrofa for keeping this issue alive. Iām surprised that there is so few snapd+lxd users out there. We need to fix this
While I looked through the PR:s I noticed a comment from @stgraber at https://github.com/snapcore/snapd/pull/4560#discussion_r169230231 and I concur, this test will not catch the bug described in this thread. I suggest the test is updated to add a reboot to prevent future regressions.