Test failures with "cannot create lock directory /run/snapd/lock"

I’m not sure what the root cause is but this just bit me today when testing the stable channel of both the charms and snaps.

ii snapd 2.25 amd64 Tool to interact with Ubuntu Core Snappy.
Linux juju-523d1d-2 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

I’ve since reproduced on lxd, I’m about to check if this is related to the confinement of lxd and give it a go on a cloud, but I suspect we are going to find similar results.

Can you still reproduce this? Can you tell me as much as you can about your environment? I’d like to be able to reproduce this and analyze the problem myself.

Certainly. let me start with package versions:

ii  lxd                                    2.14-0ubuntu1~ubuntu16.0 amd64           
snapd                                  2.21 


charles@carbon:~$ snap list
Name        Version       Rev   Developer     Notes
charm       2.2           15    charms        classic
conjure-up  2.1.5         352   canonical     classic
core        16-2          1689  canonical     -
petname     2.7           12    kirkland      -

I started by bootstrapping lxd

juju bootstrap localhost localhost

Next step is to deploy kubernetes core bundle via conjure-up

conjure-up kubernetes-core localhost localhost $(petname)

During the installation phase, both kubernetes-worker and kubernetes-master will fail, upon further investigation it appears that the snap packages for kube-apiserver and kubelet are the culprits of the failure.

Manual installation returns > 0

All i see in the debug log is regarding cgroup denials.

[45167.170493] audit: type=1400 audit(1496685571.760:30238): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=12801 comm="apparmor_parser"
[45183.016805] audit: type=1400 audit(1496685587.604:30239): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.daemon" pid=13517 comm="apparmor_parser"
[45183.097640] audit: type=1400 audit(1496685587.684:30240): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.hook.configure" pid=13519 comm="apparmor_parser"
[45183.187374] audit: type=1400 audit(1496685587.776:30241): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.kube-apiserver" pid=13523 comm="apparmor_parser"
[45183.789064] audit: type=1400 audit(1496685588.376:30242): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=13593 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45183.865330] audit: type=1400 audit(1496685588.452:30243): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=13725 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45184.859105] audit: type=1400 audit(1496685589.448:30244): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=14027 comm="apparmor_parser"
[45184.874528] audit: type=1400 audit(1496685589.464:30245): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=14027 comm="apparmor_parser"
[45184.879056] audit: type=1400 audit(1496685589.468:30246): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=14047 comm="apparmor_parser"
[45196.169401] audit: type=1400 audit(1496685600.756:30247): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14473 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45196.249654] audit: type=1400 audit(1496685600.836:30248): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14492 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45196.618100] audit: type=1400 audit(1496685601.204:30249): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14561 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

Note: that when using conjure-up it does apply a special profile to the containers contained in this model.

source: https://github.com/conjure-up/spells/blob/master/kubernetes-core/steps/lxd-profile.yaml

which looks like the following when expanded

charles@carbon:~$ lxc profile show juju-frank-bat
config:
  boot.autostart: "true"
  linux.kernel_modules: ip_tables,ip6_tables,netlink_diag,nf_nat,overlay
  raw.lxc: |
    lxc.aa_profile=unconfined
    lxc.mount.auto=proc:rw sys:rw
    lxc.cap.drop=
  security.nesting: "true"
  security.privileged: "true"
description: ""
devices:
  aadisable:
    path: /sys/module/nf_conntrack/parameters/hashsize
    source: /dev/null
    type: disk
  aadisable1:
    path: /sys/module/apparmor/parameters/enabled
    source: /dev/null
    type: disk
  root:
    path: /
    pool: conjureup
    type: disk
name: juju-frank-bat
used_by:
- /1.0/containers/juju-523d1d-0
- /1.0/containers/juju-523d1d-1
- /1.0/containers/juju-523d1d-2
- /1.0/containers/juju-523d1d-3

It’s worrth noting we are also leveraging the squashfuse package since we’re running snapd in lxd in this setup. this is implied with layer-snap.

Manual attempts to install the snap package return > 0 as mentioned above.

ubuntu@juju-523d1d-2:~$ sudo snap install kube-apiserver --channel=1.6/stable
2017-06-05T18:08:11Z INFO cannot auto connect kube-apiserver:network-bind to core:network-bind: (plug auto-connection), existing connection state "kube-apiserver:network-bind core:network-bind" in the way
error: cannot perform the following tasks:
- Run configure hook of "kube-apiserver" snap if present (run hook "configure": cannot create lock directory /run/snapd/lock: Permission denied)
ubuntu@juju-523d1d-2:~$ echo $?
1

and relevant entries from dmesg

[45524.016400] audit: type=1400 audit(1496685928.577:30290): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=25977 comm="apparmor_parser"
[45524.042417] audit: type=1400 audit(1496685928.605:30291): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=25977 comm="apparmor_parser"
[45524.047111] audit: type=1400 audit(1496685928.609:30292): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=25979 comm="apparmor_parser"
[45687.503631] audit: type=1400 audit(1496686092.055:30293): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.daemon" pid=29891 comm="apparmor_parser"
[45687.571116] audit: type=1400 audit(1496686092.123:30294): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.hook.configure" pid=29893 comm="apparmor_parser"
[45687.669554] audit: type=1400 audit(1496686092.219:30295): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.kube-apiserver" pid=29895 comm="apparmor_parser"
[45688.274890] audit: type=1400 audit(1496686092.823:30296): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=29956 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45688.362630] audit: type=1400 audit(1496686092.911:30297): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=30067 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45688.491876] audit: type=1400 audit(1496686093.043:30298): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=30072 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45689.426779] audit: type=1400 audit(1496686093.975:30299): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=30278 comm="apparmor_parser"
[45689.443265] audit: type=1400 audit(1496686093.995:30300): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=30278 comm="apparmor_parser"
[45689.447997] audit: type=1400 audit(1496686093.999:30301): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=30280 comm="apparmor_parser"
1 Like

Woooo! Finally someone can see this problem live! Thanks, @lazypower!

@zyga-snapd Have fun!

1 Like

Hey!

Thank you for providing detailed information. I would appreciate if you could add some more: snap version on the machine just before you start installing things. I’ve also noticed that you start with snapd 2.21, is this old because juju doesn’t apply updates or for some other reason? What is the kernel running all of those? Is lxd coming from some PPA? On 16.04 I see 2.13-0ubuntu3~ubuntu16.04.1.

Also as a small favor, can you please edit your post and use triple backticks for the copy-pasted sections? It is easier on my eyes to read

Ah, the 2.21 is on the host machine, the 2.25 appears to be in the lxd container where those initial commands were executed.

There’s a couple layers here for things to potentially go awry.

According to uname -a i’m running a 4.8.0-53 generic kernel

$ uname -a
Linux carbon 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Additionally, when I probe for the lxd version it appears the version of conjure-up i’m using pulled in a PPA for lxd. So that apt-get upgade after snap installing conjure-up was the culprit of the PPA version thats incoming. conjure-up needs a fairly recent version of lxd.

$ apt-cache policy lxd
lxd:
  Installed: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Candidate: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Version table:
 *** 2.14-0ubuntu1~ubuntu16.04.1~ppa1 500
        500 http://ppa.launchpad.net/ubuntu-lxc/lxd-stable/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status
     2.13-0ubuntu3~ubuntu16.04.1 100
        100 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages
     2.0.9-0ubuntu1~16.04.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2-0ubuntu1~16.04.1 500
        500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
     2.0.0-0ubuntu4 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

My gut feeling says that we’re running into a lxd/snapd bug where the apparmor profile from the outside of the system confines snap-confine inside the container. I’m working on a test case.

Interestingly enough, I updated to snapd 2.25, and purged the lxd package in favor of the lxd that’s shipping in the conjure-up snap and did not encounter this blocking error.

I’m unsure which was the final culprit; if it was the version of snapd, or the lxd version in play but there is a workaround in sight if anyone else encounters this issue.

I spent some time trying to reproduce this today. I started with a fresh install of 16.04.1 on amd64.

I updated all the packages, including the kernel, and rebooted. I also installed the same snaps as the reporter:

kernel

zyga@ubuntu:~$ uname -a
Linux ubuntu 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

debs

zyga@ubuntu:~$ apt-cache policy snapd lxd 
snapd:
  Installed: 2.25
  Candidate: 2.25
  Version table:
 *** 2.25 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
lxd:
  Installed: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Candidate: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Version table:
 *** 2.14-0ubuntu1~ubuntu16.04.1~ppa1 500
        500 http://ppa.launchpad.net/ubuntu-lxc/lxd-stable/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status
     2.13-0ubuntu3~ubuntu16.04.1 100
        100 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages
     2.0.9-0ubuntu1~16.04.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2-0ubuntu1~16.04.1 500
        500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
     2.0.0-0ubuntu4 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

snaps

zyga@ubuntu:~$ snap list
Name        Version  Rev   Developer  Notes
charm       2.2      15    charms     classic
conjure-up  2.1.5    352   canonical  classic
core        16-2     1689  canonical  -
petname     2.7      12    kirkland   -

Then I ran the conjure-up command as instructed:

zyga@ubuntu:~$ conjure-up kubernetes-core localhost localhost $(petname)

After running for a considerable amount of time (~ an hour give it or take) it failed because my installation did not have paswordless sudo enabled:

zyga@ubuntu:~$ conjure-up kubernetes-core localhost localhost $(petname)
[info] Summoning kubernetes-core to localhost
[info] Using controller 'localhost'
[info] Creating new model named 'enough-roughy', please wait.
[info] Running pre deployment tasks.
[info] Finished pre deploy task: Successful pre-deploy.
[info] Deploying easyrsa... 
[info] easyrsa: deployed, installing.
[info] Deploying etcd... 
[info] etcd: deployed, installing.
[info] Deploying flannel... 
[info] flannel: deployed, installing.
[info] Deploying kubernetes-master... 
[info] kubernetes-master: deployed, installing.
[info] Deploying kubernetes-worker... 
[info] kubernetes-worker: deployed, installing.
[info] Setting application relations
[info] Completed setting application relations
[info] Waiting for applications to start
[error] Step requires passwordless sudo: Kubernetes Cluster Controller
Exception in worker
Traceback (most recent call last):
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/thread.py", line 66, in _worker
    work_item.run()
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/thread.py", line 59, in run
    self.future.set_result(result)
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/_base.py", line 494, in set_result
    self._invoke_callbacks()
  File "/snap/conjure-up/352/usr/lib/python3.6/concurrent/futures/_base.py", line 297, in _invoke_callbacks
    callback(self)
  File "/snap/conjure-up/352/lib/python3.6/site-packages/conjureup/controllers/deploystatus/tui.py", line 24, in finish
    return controllers.use('steps').render()
  File "/snap/conjure-up/352/lib/python3.6/site-packages/conjureup/controllers/steps/tui.py", line 34, in render
    sys.exit(1)
SystemExit: 1

Still, I wanted to see if there are any interesting apparmor denials in the system. The vast majority of the denials were caused by snapctl and socket probing. Filtering those out I see:

zyga@ubuntu:~$ dmesg | grep DENIED | egrep -v 'family="inet6?"'
[   67.773206] audit: type=1400 audit(1496745326.582:50): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/sbin/dhclient" name="/run/systemd/journal/stdout" pid=1989 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[   67.773250] audit: type=1400 audit(1496745326.582:51): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/sbin/dhclient" name="/run/systemd/journal/stdout" pid=1989 comm="dhclient" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[  104.495471] audit: type=1400 audit(1496745363.387:52): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/usr/lib/lxd/lxd-bridge-proxy" name="/run/systemd/journal/stdout" pid=2896 comm="lxd-bridge-prox" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[  104.495497] audit: type=1400 audit(1496745363.387:53): apparmor="DENIED" operation="file_inherit" namespace="root//lxd-juju-bb8380-0_<var-lib-lxd>" profile="/usr/lib/lxd/lxd-bridge-proxy" name="/run/systemd/journal/stdout" pid=2896 comm="lxd-bridge-prox" requested_mask="wr" denied_mask="wr" fsuid=165536 ouid=165536
[ 2320.425172] audit: type=1400 audit(1496747584.865:464): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69532 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2321.436216] audit: type=1400 audit(1496747585.874:465): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69564 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2323.011290] audit: type=1400 audit(1496747587.452:466): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69600 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2324.338007] audit: type=1400 audit(1496747588.781:467): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69639 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2326.890994] audit: type=1400 audit(1496747591.335:468): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69825 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2328.696951] audit: type=1400 audit(1496747593.144:469): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69908 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 2330.786056] audit: type=1400 audit(1496747595.234:470): apparmor="DENIED" operation="mkdir" profile="snap.kube-apiserver.daemon" name="/run/kubernetes/" pid=69961 comm="kube-apiserver" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

So, interestingly, there are some LXD/LXC journal issues, some DHCP issues and a small set of kube-apiserver issues related to /run/kubernetes/. I’m mainly reporting this as I’m about to tear down this virtual machine and start again, this time with the desired package versions. What this shows is that up-to–date systems are not affected by the lock issue.

So here’s my theory:

  • with the quoted configuration LXD will not be confined with apparmor
  • thus apparmor in the container will not stack on top of the host
  • thus any processes in the container will be confined with whatever is on the host
  • thus snap-confine’s apparmor profile from the host will determine if snap-confine in the container crashes or not (on the lock directory)

I confirmed this with jjohansen as well as with a local setup based on my earlier experiment. EDIT: To reproduce it I did just as @lazypower did - downgrade snapd on the host to 2.21. This is sufficient because then the 2.21 apparmor profile confines snap-confine in the container.

The outcome is:

  • this is not a supported configuration
  • this is not a bug in snap-confine or snapd that we can fix

Question is if we can be proactive about detecting it and giving users a helpful message that would get them out of the problem.

1 Like

So thinking about this briefly we could do this:

  • read the checksum of the profile applied to snap-confine at runtime
  • have snapd provide the checksum of the profile of snap-confine it expects to use when running snap-confine
  • if a mismatch is found report an informative error (e.g. linking to a forum thread)

Great investigative work zyga,

We disable that apparmor profile (initially pioneered by @adam.stokes) in the case of Kubernetes on LXD for three total components:

  • Flannel Networking
  • Docker in LXD
  • Kubelet

All of these components misebehave when strictly confined by apparmor. Flannel needs to create tap devices, docker will misbehave in various scenarios depending on the workload requirements and what’s mapped into the container / fs choice / etc. and Kubelet is the primary driver of worker units in kubernetes. So it needs fairly permissive restrictions.

More details about our confinement can be gleened from @cynerva and Rye Terrell, as they wrote the snap packages determining the confinement requirements.

We can probably revisit this with stricter confinement policies, but that’s not something we have on the roadmap today.

I think your proposal of determining if the user is running in an unsupported configuration is fine: but whats the end result? Does it fail and block? Will it simply create error spam for the user to self educate? I’m a little concerned that we are painting ourselves into a corner without much time to correct the situation if blocking is the case.

Yes, it will always fail and block. We cannot change that. Even assuming we could “free” ourselves from the profile (which we cannot do) it would also break security of the host itself.

Ok, with respect to this specific issue that makes sense as it was blocked regardless. I was speaking more in terms of broad strokes with the messaging of unsupported configuration :slight_smile:

Hi,

Last week snaps on my laptop were working fine. This morning I’ve encountered the error described above. I’ve commented on the launchpad bug. None of the workarounds suggested here or in the bug help me get back to a working system.

Does anyone have any advice on how I can get my snappy system out of this failure mode? Several of my day-to-day development tools are now shipped as snaps (e.g. charm-tools), so this has a pretty large impact on my productivity :frowning2:.

Cheers,

@thomir Can you please confirm you have a /etc/apparmor.d/usr.lib.snapd.snap-confine.real file in your system including the lines:

# support for locking
/run/snapd/lock/ rw,
/run/snapd/lock/*.lock rwk,

If so, please try running this command:

apparmor_parser --replace --write-cache /etc/apparmor.d/usr.lib.snapd.snap-confine.real --cache-loc /etc/apparmor.d/cache

@mvo His system has a newer snapd outside (2.25) and an older one inside (2.24), which means no re-exec. The snap-confine profile from inside would break snapd from the outside because it’s too old. Are we testing to make sure that we don’t copy the profile in those cases? On a quick skim I could not find logic for that.

Hi @niemeyer

I do have those lines included in my /etc/apparmor.d/usr.lib.snapd.snap-confine.real, and running the apparmor_parser command does indeed fix the problem.

Thanks!

@niemeyer We never override the /etc/apparmor.d/usr.lib.snapd.snap-confine.real profile, if we copy/rewrite profiles they always get unique names like snap.core.2128.usr.lib.snapd.snap-confine

Thanks @thomir for confirming that the apparmor replace command helped. The interesting question is why the profile was not applied at package install. Fwiw, we had a issue during the sprint last week when we also had apparmor profiles not correctly applied. We will investigate this issue further.

@fgimenez reported a bug about this https://bugs.launchpad.net/snapd/+bug/1701494 here. It could be the same underlying issue.