I was thinking about a Before=snapd.mount
in our autogenerated mount units but I was under the impression that systemd must be already doing this (the before I mean). Perhaps something is wrong with the snapd.mount
unit and the fallback code in snap-confine is doing the remount and we suffer the bug then. I didn’t check that.
Just blew another hour babysitting all my containers through updating all their snaps. It actually takes less time to blow away an entire testing container, create a new one, and set it all up for my daily tests, than it takes to update.
/me blows more time babysitting containers through another update…
Any progress here?
I’d be happy to help with this if I understood snap-confine at all. I’m afraid I don’t, so all I can really do is offer to help test. I lose hair in frustration every time an update comes out!
@jdstrand you’re the only other person I know for sure has snap-confine experience. Do you have any insights, here?
I have a fix for this now. I’ll send a PR shortly, I’m trying to see if any tests breaks.
EDIT: I was fighting this all yesterday and it’s still not fixed. I’m trying another strategy. This is one nasty problem.
A quick update: @zyga-snapd put up a proposal that needs to be tweaked:
<zyga-solus> kyrofa: I discussed it with jdstrand, we need to tweak it slightly in order not to make snap-confine too powerful,
<zyga-solus> kyrofa: I’ll get back to it
<zyga-solus> kyrofa: I’m looking at why master breaks so often as this clamps our velocity a lot
<kyrofa> Ah okay, very good. Still a path forward, though?
<zyga-solus> yes, totally
Is anyone still around working on this problem? Or is everyone gone for holidays? I don’t know how to raise the importance of this. I’m seriously considering running stuff from source instead of using my own snaps, and that’s heartbreaking.
I’m happy to chat about it in person, but I suspect I’ll sound like a broken record, basically repeating what I said in https://discuss.linuxcontainers.org/t/snapd-cant-remove-old-revisions-when-running-inside-lxd/452
Sounded like @zyga-snapd had a branch which was getting snap-confine to attempt to fix this, though I’m not sure how exactly that would work given that systemd would still be mounting those snaps automatically on boot, quite possibly much before snap-confine itself is called by the first snap starting. Unless there’s some clever systemd dependency ordering going on there somehow?
I remember that my original suggestion for this was to have a snap.mount unit which would have systemd itself do the bind-mount and MS_SLAVE remount of /snap, doing that would have systemd properly order its mount units, guaranteeing that snap.mount is processed before any other directory underneath it.
I don’t remember if you can have the systemd unit declare both the bind-mount + MS_SLAVE remount in one go, but if not, this should be achievable by using a post-start action on the unit, to have it perform the remount.
I did some attempt but I ran into issues with either systemd or with a security review when trying to work around deficiencies in systemd.
The crux of the limitation was indeed that /snap
mount unit is not enough as there’s no way to apply MS_SLAVE this way. I will try your suggestion to have a post-start action that changes sharing.
As one annoying limitation FUSE mounts are not reliably represented in /proc/self/mountinfo
so we cannot unmount and remount them to fix something. We must ask systemd to do that but this is too much power to wield from snap-confine. (this is what my earlier branch attempted).
Did this approach work?
@kyrofa no, not really; we discussed this with @mvo today and there’s another attempt in https://github.com/snapcore/snapd/pull/4517
I’m afraid this may not be fixed, or perhaps there’s another problem. I’m using candidate in LXD:
$ snap version
snap 2.31
snapd 2.31
series 16
ubuntu 16.04
kernel 4.4.0-112-generic
Trying to remove a snap I get this:
$ sudo snap remove nextcloud
2018-02-17T18:08:35Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: [stop
snap-nextcloud-5132.mount] failed with exit status 1: Job for snap-nextcloud-5132.mount failed. See
"systemctl status snap-nextcloud-5132.mount" and "journalctl -xe" for details.
Remove snap "nextcloud" (5132) from the system .^C
ubuntu@nextcloud-proxy-test:~$ snap changes
ID Status Spawn Ready Summary
1 Done 2018-02-17T17:40:32Z 2018-02-17T17:40:32Z Initialize system state
2 Done 2018-02-17T17:42:37Z 2018-02-17T17:43:01Z Install "core" snap from "candidate" channel
3 Done 2018-02-17T17:42:37Z 2018-02-17T17:42:40Z Initialize device
4 Done 2018-02-17T17:43:15Z 2018-02-17T17:44:00Z Install "nextcloud" snap
5 Done 2018-02-17T17:47:04Z 2018-02-17T17:47:06Z Change configuration of "nextcloud" snap
6 Doing 2018-02-17T18:07:57Z - Remove "nextcloud" snap
ubuntu@nextcloud-proxy-test:~$ snap change 6
Status Spawn Ready Summary
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Stop snap "nextcloud" services
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Run remove hook of "nextcloud" snap if present
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:33Z Remove aliases for snap "nextcloud"
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Make snap "nextcloud" unavailable to the system
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Remove security profile for snap "nextcloud" (5132)
Done 2018-02-17T18:07:57Z 2018-02-17T18:08:34Z Remove data for snap "nextcloud" (5132)
Doing 2018-02-17T18:07:57Z - Remove snap "nextcloud" (5132) from the system
Do 2018-02-17T18:07:57Z - Discard interface connections for snap "nextcloud" (5132)
......................................................................
Remove snap "nextcloud" (5132) from the system
2018-02-17T18:08:35Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: [stop snap-nextcloud-5132.mount] failed with exit status 1: Job for snap-nextcloud-5132.mount failed. See "systemctl status snap-nextcloud-5132.mount" and "journalctl -xe" for details.
ubuntu@nextcloud-proxy-test:~$ systemctl status snap-nextcloud-5132.mount
● snap-nextcloud-5132.mount - Mount unit for nextcloud
Loaded: loaded (/proc/self/mountinfo; enabled; vendor preset: enabled)
Active: active (mounted) (Result: exit-code) since Sat 2018-02-17 18:08:35 UTC; 42s ago
Where: /snap/nextcloud/5132
What: squashfuse
Process: 12833 ExecUnmount=/bin/umount /snap/nextcloud/5132 (code=exited, status=32)
Tasks: 1
Memory: 1.0M
CPU: 11.157s
CGroup: /system.slice/snap-nextcloud-5132.mount
└─8363 squashfuse /var/lib/snapd/snaps/nextcloud_5132.snap /snap/nextcloud/5132 -o ro,nodev,
Feb 17 17:43:50 nextcloud-proxy-test systemd[1]: Mounting Mount unit for nextcloud...
Feb 17 17:43:50 nextcloud-proxy-test systemd[1]: Mounted Mount unit for nextcloud.
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: Unmounting Mount unit for nextcloud...
Feb 17 18:08:35 nextcloud-proxy-test umount[12833]: umount: /snap/nextcloud/5132: not mounted
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: snap-nextcloud-5132.mount: Mount process exited, code=
Feb 17 18:08:35 nextcloud-proxy-test systemd[1]: Failed unmounting Mount unit for nextcloud.
You will need the 2.31-deb based package to get the fix. I. heard that one is coming out soon though.
The snapd 2.31.1 debs release are in *-proposed - install them from there for testing.
This issue should be fixed everywhere now. Please post comments in case you are affected again.
Thank you for the fix! I’m really happy to be able to use snaps on my servers again (everything runs inside LXD). Thank you @kyrofa for keeping this issue alive. I’m surprised that there is so few snapd+lxd users out there. We need to fix this
While I looked through the PR:s I noticed a comment from @stgraber at https://github.com/snapcore/snapd/pull/4560#discussion_r169230231 and I concur, this test will not catch the bug described in this thread. I suggest the test is updated to add a reboot to prevent future regressions.