Test failures with "cannot create lock directory /run/snapd/lock"

lazypower · June 5, 2017, 8:23pm

Interestingly enough, I updated to snapd 2.25, and purged the lxd package in favor of the lxd that’s shipping in the conjure-up snap and did not encounter this blocking error.

I’m unsure which was the final culprit; if it was the version of snapd, or the lxd version in play but there is a workaround in sight if anyone else encounters this issue.

zyga-snapd · June 6, 2017, 11:45am

I spent some time trying to reproduce this today. I started with a fresh install of 16.04.1 on amd64.

I updated all the packages, including the kernel, and rebooted. I also installed the same snaps as the reporter:

kernel

zyga@ubuntu:~$ uname -a
Linux ubuntu 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

debs

zyga@ubuntu:~$ apt-cache policy snapd lxd 
snapd:
  Installed: 2.25
  Candidate: 2.25
  Version table:
 *** 2.25 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
lxd:
  Installed: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Candidate: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Version table:
 *** 2.14-0ubuntu1~ubuntu16.04.1~ppa1 500
        500 http://ppa.launchpad.net/ubuntu-lxc/lxd-stable/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status
     2.13-0ubuntu3~ubuntu16.04.1 100
        100 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages
     2.0.9-0ubuntu1~16.04.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2-0ubuntu1~16.04.1 500
        500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
     2.0.0-0ubuntu4 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

snaps

zyga@ubuntu:~$ snap list
Name        Version  Rev   Developer  Notes
charm       2.2      15    charms     classic
conjure-up  2.1.5    352   canonical  classic
core        16-2     1689  canonical  -
petname     2.7      12    kirkland   -

Then I ran the conjure-up command as instructed:

zyga@ubuntu:~$ conjure-up kubernetes-core localhost localhost $(petname)

After running for a considerable amount of time (~ an hour give it or take) it failed because my installation did not have paswordless sudo enabled:

zyga@ubuntu:~$ conjure-up kubernetes-core localhost localhost $(petname)
[info] Summoning kubernetes-core to localhost
[info] Using controller 'localhost'
[info] Creating new model named 'enough-roughy', please wait.
[info] Running pre deployment tasks.
[info] Finished pre deploy task: Successful pre-deploy.
[info] Deploying easyrsa... 
[info] easyrsa: deployed, installing.
[info] Deploying etcd... 
[info] etcd: deployed, installing.
[info] Deploying flannel... 
[info] flannel: deployed, installing.
[info] Deploying kubernetes-master... 
[info] kubernetes-master: deployed, installing.
[info] Deploying kubernetes-worker... 
[info] kubernetes-worker: deployed, installing.
[info] Setting application relations
[info] Completed setting application relations
[info] Waiting for applications to start
[error] Step requires passwordless sudo: Kubernetes Cluster Controller
Exception in worker
Traceback (most recent call last):
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/thread.py", line 66, in _worker
    work_item.run()
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/thread.py", line 59, in run
    self.future.set_result(result)
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/_base.py", line 494, in set_result
    self._invoke_callbacks()
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/_base.py", line 297, in _invoke_callbacks
    callback(self)
  File "/snap/conjure-up/352/lib/python3.6/site-packages/conjureup/controllers/deploystatus/tui.py", line 24, in finish
    return controllers.use('steps').render()
  File "/snap/conjure-up/352/lib/python3.6/site-packages/conjureup/controllers/steps/tui.py", line 34, in render
    sys.exit(1)
SystemExit: 1

Still, I wanted to see if there are any interesting apparmor denials in the system. The vast majority of the denials were caused by snapctl and socket probing. Filtering those out I see:

zyga@ubuntu:~$ dmesg | grep DENIED | egrep -v 'family="inet6?"'
[   67.773206] audit: type=1400 audit(1496745326.582:50): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/sbin/dhclient" name="/run/systemd/journal/stdout" pid=1989 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[   67.773250] audit: type=1400 audit(1496745326.582:51): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/sbin/dhclient" name="/run/systemd/journal/stdout" pid=1989 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[  104.495471] audit: type=1400 audit(1496745363.387:52): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/usr/lib/lxd/lxd-bridge-proxy" name="/run/systemd/journal/stdout" pid=2896 comm="lxd-bridge-prox" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[  104.495497] audit: type=1400 audit(1496745363.387:53): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/usr/lib/lxd/lxd-bridge-proxy" name="/run/systemd/journal/stdout" pid=2896 comm="lxd-bridge-prox" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[ 2320.425172] audit: type=1400 audit(1496747584.865:464): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69532 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2321.436216] audit: type=1400 audit(1496747585.874:465): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69564 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2323.011290] audit: type=1400 audit(1496747587.452:466): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69600 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2324.338007] audit: type=1400 audit(1496747588.781:467): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69639 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2326.890994] audit: type=1400 audit(1496747591.335:468): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69825 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2328.696951] audit: type=1400 audit(1496747593.144:469): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69908 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2330.786056] audit: type=1400 audit(1496747595.234:470): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69961 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

So, interestingly, there are some LXD/LXC journal issues, some DHCP issues and a small set of kube-apiserver issues related to /run/kubernetes/. I’m mainly reporting this as I’m about to tear down this virtual machine and start again, this time with the desired package versions. What this shows is that up-to–date systems are not affected by the lock issue.

zyga-snapd · June 6, 2017, 5:27pm

So here’s my theory:

with the quoted configuration LXD will not be confined with apparmor
thus apparmor in the container will not stack on top of the host
thus any processes in the container will be confined with whatever is on the host
thus snap-confine’s apparmor profile from the host will determine if snap-confine in the container crashes or not (on the lock directory)

I confirmed this with jjohansen as well as with a local setup based on my earlier experiment. EDIT: To reproduce it I did just as @lazypower did - downgrade snapd on the host to 2.21. This is sufficient because then the 2.21 apparmor profile confines snap-confine in the container.

The outcome is:

this is not a supported configuration
this is not a bug in snap-confine or snapd that we can fix

Question is if we can be proactive about detecting it and giving users a helpful message that would get them out of the problem.

zyga-snapd · June 6, 2017, 5:49pm

So thinking about this briefly we could do this:

read the checksum of the profile applied to snap-confine at runtime
have snapd provide the checksum of the profile of snap-confine it expects to use when running snap-confine
if a mismatch is found report an informative error (e.g. linking to a forum thread)

lazypower · June 6, 2017, 6:03pm

Great investigative work zyga,

We disable that apparmor profile (initially pioneered by @adam.stokes) in the case of Kubernetes on LXD for three total components:

Flannel Networking
Docker in LXD
Kubelet

All of these components misebehave when strictly confined by apparmor. Flannel needs to create tap devices, docker will misbehave in various scenarios depending on the workload requirements and what’s mapped into the container / fs choice / etc. and Kubelet is the primary driver of worker units in kubernetes. So it needs fairly permissive restrictions.

More details about our confinement can be gleened from @cynerva and Rye Terrell, as they wrote the snap packages determining the confinement requirements.

We can probably revisit this with stricter confinement policies, but that’s not something we have on the roadmap today.

I think your proposal of determining if the user is running in an unsupported configuration is fine: but whats the end result? Does it fail and block? Will it simply create error spam for the user to self educate? I’m a little concerned that we are painting ourselves into a corner without much time to correct the situation if blocking is the case.

zyga-snapd · June 6, 2017, 6:07pm

Yes, it will always fail and block. We cannot change that. Even assuming we could “free” ourselves from the profile (which we cannot do) it would also break security of the host itself.

lazypower · June 6, 2017, 6:09pm

Ok, with respect to this specific issue that makes sense as it was blocked regardless. I was speaking more in terms of broad strokes with the messaging of unsupported configuration

thomir · July 3, 2017, 5:07am

Hi,

Last week snaps on my laptop were working fine. This morning I’ve encountered the error described above. I’ve commented on the launchpad bug. None of the workarounds suggested here or in the bug help me get back to a working system.

Does anyone have any advice on how I can get my snappy system out of this failure mode? Several of my day-to-day development tools are now shipped as snaps (e.g. charm-tools), so this has a pretty large impact on my productivity .

Cheers,

niemeyer · July 3, 2017, 5:03pm

@thomir Can you please confirm you have a /etc/apparmor.d/usr.lib.snapd.snap-confine.real file in your system including the lines:

# support for locking
/run/snapd/lock/ rw,
/run/snapd/lock/*.lock rwk,

If so, please try running this command:

apparmor_parser --replace --write-cache /etc/apparmor.d/usr.lib.snapd.snap-confine.real --cache-loc /etc/apparmor.d/cache

@mvo His system has a newer snapd outside (2.25) and an older one inside (2.24), which means no re-exec. The snap-confine profile from inside would break snapd from the outside because it’s too old. Are we testing to make sure that we don’t copy the profile in those cases? On a quick skim I could not find logic for that.

thomir · July 3, 2017, 7:50pm

Hi @niemeyer

I do have those lines included in my /etc/apparmor.d/usr.lib.snapd.snap-confine.real, and running the apparmor_parser command does indeed fix the problem.

Thanks!

mvo · July 4, 2017, 8:04am

@niemeyer We never override the /etc/apparmor.d/usr.lib.snapd.snap-confine.real profile, if we copy/rewrite profiles they always get unique names like snap.core.2128.usr.lib.snapd.snap-confine

mvo · July 4, 2017, 8:06am

Thanks @thomir for confirming that the apparmor replace command helped. The interesting question is why the profile was not applied at package install. Fwiw, we had a issue during the sprint last week when we also had apparmor profiles not correctly applied. We will investigate this issue further.

mvo · July 4, 2017, 8:19am

@fgimenez reported a bug about this https://bugs.launchpad.net/snapd/+bug/1701494 here. It could be the same underlying issue.

zyga-snapd · July 4, 2017, 8:24am

During the sprint last week we found a very similar issue and managed to collect the data of the loaded profile “before” and “after”. We have collected the data and are tracking this in this bug https://bugs.launchpad.net/snapd/+bug/1701494 Sadly we just realized that the data collected was empty (because tar and sysfs) but we’re confident this can be re-tried. I will update this bug again when we have more information.

zyga-snapd · July 5, 2017, 1:27pm

Hey, are you still affected by this bug? Was it just once or does it occur repeatedly (e.g. after reboot?)

zyga-snapd · July 5, 2017, 2:12pm

Hey @thomir id you build snapd by hand by any chance? Can you tell me more about your environment?

thomir · July 6, 2017, 2:41am

Hi @zyga-snapd,

I have built snapd from source, but not for a long time, and the version running in my system is not built from source. I’m running zesty, my core snap is version 16-2 (revision 1689), and the host snapd package is version 2.25+17.04.

Let me know what other information you’d like…

zyga-snapd · July 6, 2017, 12:57pm

Thanks, it would be fantastic if you could still reproduce this.

EDIT: Can you tell me if when you first saw this, did you reboot your computer before? I recall you said it was “in the morning” but I’m not sure if that was an actual cold boot.

thomir · July 9, 2017, 9:17pm

Hi @zyga-snapd,

Sadly it’s working as expected now, so I can’t reproduce any more. This wasn’t after a cold boot, I tend to leave my laptop on all the time.

Cheers,

zyga-snapd · July 10, 2017, 8:51am

Do you have any system logs from that period? If you still do, could you collect logs before the event (say the evening before) all the way to the moment when it broke. Each time apparmor profiles are updated there’s a trace in the system logs. This could help me figure out what the problem is.