LXC: snaps don't update

critical
zyga

#1

I have three Nextcloud revisions available:

$ ls -l /snap/nextcloud/
total 0
drwxr-xr-x 24 root root 0 Apr 25 17:44 1212
drwxr-xr-x 24 root root 0 Apr 30 08:10 1284
drwxr-xr-x 24 root root 0 May 16 20:27 1337
lrwxrwxrwx  1 root root 4 May 16 21:07 current -> 1337

Note that the oldest is 1212. Once I update, I expect that revision to be gone. This is what happens:

$ sudo snap refresh --candidate nextcloud
2017-05-25T22:30:49Z INFO cannot auto connect nextcloud:network-bind to core:network-bind: (plug auto-connection), existing connection state "nextcloud:network-bind core:network-bind" in the way
2017-05-25T22:30:53Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: [stop snap-nextcloud-1212.mount] failed with exit status 1: Job for snap-nextcloud-1212.mount failed. See "systemctl status snap-nextcloud-1212.mount" and "journalctl -xe" for details.

2017-05-25T22:33:53Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: umount: /snap/nextcloud/1212: not mounted
2017-05-25T22:36:54Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: umount: /snap/nextcloud/1212: not mounted
2017-05-25T22:39:54Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: umount: /snap/nextcloud/1212: not mounted
2017-05-25T22:42:54Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: umount: /snap/nextcloud/1212: not mounted

So it looks like, indeed, it’s gone. But snapd doesn’t expect it to be gone. This goes on forever. Thankfully ctrl+c stops it without causing the entire operation to be canceled: the snap still seems to be updated. But what if this was an automated refresh? Would it just be stuck forever?

$ snap version
snap    2.25
snapd   2.25
series  16
ubuntu  16.04
kernel  4.4.0-78-generic

Snapcraft adt failures with the new core release
Snapd in LXD failing to upgrade to 2.31
#2

It was systemd that reported a failure during the procedure. Is it saying anything in the logs? (see the commands it suggested there)


#3
$ systemctl status snap-nextcloud-1212.mount
â—Ź snap-nextcloud-1212.mount - Mount unit for nextcloud
   Loaded: loaded (/proc/self/mountinfo; enabled; vendor preset: enabled)
   Active: active (mounted) (Result: exit-code) since Thu 2017-05-25 22:30:53 UTC; 17h ago
    Where: /snap/nextcloud/1212
     What: squashfuse
  Process: 2146 ExecUnmount=/bin/umount /snap/nextcloud/1212 (code=exited, status=32)
    Tasks: 1
   Memory: 440.0K
      CPU: 6ms
   CGroup: /system.slice/snap-nextcloud-1212.mount
           └─109 squashfuse /var/lib/snapd/snaps/nextcloud_1212.snap /snap/nextcloud/1212 -o ro,nodev,allow_other,suid

May 25 22:25:35 staging systemd[1]: Mounting Mount unit for nextcloud...
May 25 22:25:35 staging systemd[1]: Mounted Mount unit for nextcloud.
May 25 22:30:53 staging systemd[1]: Unmounting Mount unit for nextcloud...
May 25 22:30:53 staging umount[2146]: umount: /snap/nextcloud/1212: not mounted
May 25 22:30:53 staging systemd[1]: snap-nextcloud-1212.mount: Mount process exited, code=exited status=32
May 25 22:30:53 staging systemd[1]: Failed unmounting Mount unit for nextcloud.

The journal shows me:

May 26 16:11:18 staging /usr/lib/snapd/snapd[367]: taskrunner.go:367: DEBUG: Running task 177 on Doing: Remove snap "nextcloud" (1212) from the system
May 26 16:11:18 staging /usr/lib/snapd/snapd[367]: task.go:303: DEBUG: 2017-05-26T16:11:18Z ERROR cannot remove snap file "nextcloud", will retry in 3 mins: umount: /snap/nextcloud/1212: not mounted

Huh, it’s still going this morning even though I ctrl+c’d it yesterday:

$ snap change 34
Status  Spawn                 Ready                 Summary
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:00Z  Download snap "nextcloud" (1446) from channel "candidate"
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:02Z  Fetch and check assertions for snap "nextcloud" (1446)
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:03Z  Mount snap "nextcloud" (1446)
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:39Z  Stop snap "nextcloud" services
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:40Z  Remove aliases for snap "nextcloud"
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:41Z  Make current revision for snap "nextcloud" unavailable
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:48Z  Copy snap "nextcloud" data
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:51Z  Setup snap "nextcloud" (1446) security profiles
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:51Z  Make snap "nextcloud" (1446) available to the system
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:52Z  Set automatic aliases for snap "nextcloud"
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:52Z  Setup snap "nextcloud" aliases
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:53Z  Start snap "nextcloud" (1446) services
Done    2017-05-25T22:26:31Z  2017-05-25T22:30:53Z  Remove data for snap "nextcloud" (1212)
Doing   2017-05-25T22:26:31Z  -                     Remove snap "nextcloud" (1212) from the system
Do      2017-05-25T22:26:31Z  -                     Clean up "nextcloud" (1446) install
Do      2017-05-25T22:26:31Z  -                     Run configure hook of "nextcloud" snap if present

#4

Try just remounting it to something arbitrary… it will be gone soon.

The fact this is using squashfsfuse is likely related… it probably got unmounted elsewhere.

In either case, we need to make that part of the removal more resilient. There’s no reason to fail unmounting if it’s already unmounted, obviously. Thanks for reporting.


#5

Huh, interesting. I was about to try mounting something in place when I noticed this:

$ mount | grep 1212
squashfuse on /snap/nextcloud/1212 type fuse.squashfuse (ro,nodev,relatime,user_id=0,group_id=0,allow_other)
$ sudo umount /snap/nextcloud/1212
umount: /snap/nextcloud/1212: not mounted

I agree that this smells like squashfuse. It seems like it’s somehow partially mounted, but unmountable.


#6

Similarly odd:

$ fusermount -u /snap/nextcloud/1212 
fusermount: entry for /snap/nextcloud/1212 not found in /etc/mtab
$ grep 1212 /etc/mtab 
squashfuse /snap/nextcloud/1212 fuse.squashfuse ro,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

#7

Okay, that feels bogus indeed. Can you dig further to see what’s going on?


#8

I’m not sure what to make of this:

ubuntu@staging:~$ sudo lsof | grep 1212
squashfus   109             root    3r      REG                9,2 167022592   84282175 /var/lib/snapd/snaps/nextcloud_1212.snap
ubuntu@staging:~$ sudo fuser -vm /snap/nextcloud/1212
                     USER        PID ACCESS COMMAND
/snap/nextcloud/1212:
                     root     kernel mount /
                     root          1 Frce. systemd
                     root         51 .rce. systemd-journal
                     root         57 frce. systemd-udevd
                     root        106 frce. squashfuse
                     root        107 frce. squashfuse
                     root        108 frce. squashfuse
                     root        109 frce. squashfuse
                     root        110 frce. squashfuse
                     root        294 Frce. dhclient
                     syslog      365 .rce. rsyslogd
                     root        367 .rce. snapd
                     root        391 .rce. accounts-daemon
                     daemon      399 .rce. atd
                     root        412 .rce. cron
                     messagebus    413 .rce. dbus-daemon
                     root        415 .rce. systemd-logind
                     root        418 .rce. sshd
                     root        460 .rce. polkitd
                     root        486 .rce. agetty
                     root       1026 .rce. bash
                     root       1040 .rce. su
                     ubuntu     1041 .rce. systemd
                     ubuntu     1042 .rce. (sd-pam
                     ubuntu     1045 .rce. bash
                     root       1473 frce. squashfuse
                     root       1810 .rce. nextcloud-cron
                     root       1861 .rce. delay-on-failur
                     root       1890 .rc.. mdns-publisher
                     root       1911 .rce. run-httpd
                     root       1963 .rce. start_mysql
                     root       2016 .rce. start-php-fpm
                     root       2064 .rce. start-redis-ser
                     root       2116 .rce. renew-certs
                     root       2152 .rc.. redis-server
                     root       2389 .rce. sleep
                     root       2392 Frc.. php-fpm
                     root       2489 .rce. httpd-wrapper
                     root       2503 Frc.. httpd
                     root       2504 Frc.. httpd
                     root       2505 Frc.. httpd
                     root       2506 Frc.. httpd
                     root       2633 Fr.e. mysqld_safe
                     root       2797 Frc.. mysqld
                     root       3468 .rce. bash
                     root       3480 .rce. su
                     ubuntu     3481 .rce. bash
                     root       3537 Frc.. httpd
                     root      15264 .rce. sleep
                     root      15558 .rce. sleep
                     root      15559 .rce. sudo

#9

We unmount the snap filesystem with -lazy (aka MNT_DETACH), so it’ll remain in use by the kernel while there are live processes using it.

What is the parent of these processes? If they were started by the snap itself, they should have been terminated when the snap services stopped. This should also not be a reason for the removal to fail (that’s why we use -lazy in the first place).


#10

Let’s use pid 2016 (PHP) as an example:

$ ps l -p2016
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0  2016     1  20   0   4508   628 -      Ss   ?          0:00 /bin/sh /snap/nextcloud/1446/bin/start-php-fpm

The parent seems to be systemd. But notice that this is actually the one from rev 1446. What is happening here? A reboot doesn’t help, the task just resumes and hangs again. There seems to be no way to get snapd to just move on. If I abort it, will it roll back?


#11

As I said, when I reboot, this just resumes where it left off, in the exact same state:

$ mount | grep 1212
squashfuse on /snap/nextcloud/1212 type fuse.squashfuse (ro,nodev,relatime,user_id=0,group_id=0,allow_other)

However, if I go and move /var/lib/snapd/snaps/nextcloud_1212.snap out of the way, and reboot again, there is nothing in mount. Of course, the change still errors out because it’s still trying to unmount something that isn’t there, but it’s interesting that the .mount file is somehow messing this up in the first place.


#12

Huh… I just moved it back into place, and rebooted again, and now the change finally succeeded. The .snap file is still there though, which is odd.

I still have a container where this is happening though, if anyone has any other ways to poke at it.


#13

I hit this every time the snap updates in lxc, now.


#14

The workaround seems to be to remove/rename the snap itself in /var/lib/snapd/snaps, reboot, and wait for the unmount to be attempted again.


#15

This is making every update on my server non-automatic since I have to SSH in and baby it by helping it unmount stuff in order for an update to actually continue. Is no one else using snaps on LXD?


#16

Finally figured out the root cause of this issue thanks to the LXD folks. It’s a bug in snap-confine: https://bugs.launchpad.net/snapd/+bug/1712930 .


#17

@zyga I see you’ve put the bug reported in progress on the 25th. Can you please update this topic with details of what happened and what the plan of action is when you have a moment?

Since it’s still marked as In Progress, that’s 2.28 material I assume?


#18

It is 2.28.x or 2.29 material as as far as I know mvo forked 2.28 already. Stephane analyzed the bug but I have not, yet, come up with a solution.


#19

I’m exploring possible solutions on a 14.04 machine, I will keep you updated.


#20

Hey @zyga, any update on this?