Test failures with "cannot create lock directory /run/snapd/lock"

Observed this curious faiure on a large number of the spread tests of #3218, which is a relatively innocent looking PR that shouldn’t have caused that:

2017-04-24 15:40:51 Error executing linode:ubuntu-14.04-64:tests/regression/lp-1606277 : 
-----
+ echo 'We can now see a non-empty /var/log directory'
We can now see a non-empty /var/log directory
++ wc -l
++ log-observe-consumer.cmd ls /var/log
cannot create lock directory /run/snapd/lock: Permission denied
+ '[' 0 '!=' 0 ']'
-----

I’ve restarted the build, but can’t think of a reason why this would break.

Any ideas?

Confirmed it was a hiccup, and a pretty strange one at that. Has to be something external, but what would make tests consistently fail in this way?

We found some users reporting this on IRC. This happens when a partial update happens. In that case dpkg --configure -a fixes things. What is technically failing is that new snap-confine is unpacked but the new apparmor profile is not applied yet as this is only done in configure stage.

FYI, there are thousands of these in the error tracker so I think it may be more widespread than partial updates. Eg:

Note that kubelet (the above error) uses ‘confinement: classic’, if that makes a difference. I don’t know if the other errors are all classic confinement.

I doubt we have thousands of machines which all decided to not run the deb configuration properly just now, so there must be something else going on. @maxiberta also reports we have an ever growing number of failures reports in charts about this problem.

@zyga-snapd I’ve heard you’ve been investigating this and has further ideas. Can you please update us on the topic?

@marcoceppi Any idea of why the k8s snaps might be special? Perhaps they are not, actually, and they are simply using the configure script, which other than the core snap still not many snaps have caught up with the feature.

looking at that i see 13000+ errors for kubelet and ~7000 for kube-apiserver … trying to install these locally seems to cause no issues:

ogra@anubis:~$ sudo snap install kube-apiserver
[sudo] Passwort für ogra: 
kube-apiserver 1.6.2 from 'canonical' installed
ogra@anubis:~$ 
ogra@anubis:~$ sudo snap install kubelet --classic
kubelet 1.6.2 from 'canonical' installed
ogra@anubis:~$

@ogra The problem is likely related to the update path, coming from an old snapd, installing these snaps, and then refreshing snapd itself. This might cause the snap-confine apparmor profile to remain old in ancient snapds. Nowadays that profile is supposed to be updated automatically (unless we have a bug obviously).

Actually, that may well have been the failure in the test cases. Note how every single test failed the same way. The core snap was probably refreshed while the suite was running. @zyga-snapd, can we please dig into this first thing next week?

this might be a totally irrelevant obeservation but i’ll mention it anyway …
one thing i see when clicking randomly through all these error reports is that all seem to use snapd 2.25 … we did not release a core snap with 2.25 yet (latest stable core is 1689 which ships with 2.24), so these are all non-re-execed snapd’s

1 Like

That’s a great data point.

Absolutely.

BTW, I observerd that my in-development copy of snapd sends error reports. While I cannot expect the number of errors I saw to match developer workstations and CI I think we might be able to filter those out somehow. I still worry about this bit and I was thinking that we should, as a reliability measure, move the apparmor profile of snap-confine away from /etc (it’s not configuration really, not in the classic sense) and into /usr/share, and have it loaded from there. I’m happy to explore possibilities and discuss ideas when I’m back

Can we get some more data about the problem here? What specific versions are failing (in and out of the sandbox), what snaps and/or operations on these snaps, etc.

From what I can see on a x86_64 xenial install with various ubuntu kernel versions, a classic snap (kubelet rev 27) fails to run its configure script. There’s a range of 4.4 and 4.8 kernels from the Ubuntu archive. On some more inspection I see kube-apiserver, a strict snap, with identical symptoms.

1 Like

I’m not sure what the root cause is but this just bit me today when testing the stable channel of both the charms and snaps.

ii snapd 2.25 amd64 Tool to interact with Ubuntu Core Snappy.
Linux juju-523d1d-2 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

I’ve since reproduced on lxd, I’m about to check if this is related to the confinement of lxd and give it a go on a cloud, but I suspect we are going to find similar results.

Can you still reproduce this? Can you tell me as much as you can about your environment? I’d like to be able to reproduce this and analyze the problem myself.

Certainly. let me start with package versions:

ii  lxd                                    2.14-0ubuntu1~ubuntu16.0 amd64           
snapd                                  2.21 


charles@carbon:~$ snap list
Name        Version       Rev   Developer     Notes
charm       2.2           15    charms        classic
conjure-up  2.1.5         352   canonical     classic
core        16-2          1689  canonical     -
petname     2.7           12    kirkland      -

I started by bootstrapping lxd

juju bootstrap localhost localhost

Next step is to deploy kubernetes core bundle via conjure-up

conjure-up kubernetes-core localhost localhost $(petname)

During the installation phase, both kubernetes-worker and kubernetes-master will fail, upon further investigation it appears that the snap packages for kube-apiserver and kubelet are the culprits of the failure.

Manual installation returns > 0

All i see in the debug log is regarding cgroup denials.

[45167.170493] audit: type=1400 audit(1496685571.760:30238): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=12801 comm="apparmor_parser"
[45183.016805] audit: type=1400 audit(1496685587.604:30239): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.daemon" pid=13517 comm="apparmor_parser"
[45183.097640] audit: type=1400 audit(1496685587.684:30240): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.hook.configure" pid=13519 comm="apparmor_parser"
[45183.187374] audit: type=1400 audit(1496685587.776:30241): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.kube-apiserver" pid=13523 comm="apparmor_parser"
[45183.789064] audit: type=1400 audit(1496685588.376:30242): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=13593 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45183.865330] audit: type=1400 audit(1496685588.452:30243): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=13725 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45184.859105] audit: type=1400 audit(1496685589.448:30244): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=14027 comm="apparmor_parser"
[45184.874528] audit: type=1400 audit(1496685589.464:30245): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=14027 comm="apparmor_parser"
[45184.879056] audit: type=1400 audit(1496685589.468:30246): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=14047 comm="apparmor_parser"
[45196.169401] audit: type=1400 audit(1496685600.756:30247): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14473 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45196.249654] audit: type=1400 audit(1496685600.836:30248): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14492 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45196.618100] audit: type=1400 audit(1496685601.204:30249): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=14561 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

Note: that when using conjure-up it does apply a special profile to the containers contained in this model.

source: https://github.com/conjure-up/spells/blob/master/kubernetes-core/steps/lxd-profile.yaml

which looks like the following when expanded

charles@carbon:~$ lxc profile show juju-frank-bat
config:
  boot.autostart: "true"
  linux.kernel_modules: ip_tables,ip6_tables,netlink_diag,nf_nat,overlay
  raw.lxc: |
    lxc.aa_profile=unconfined
    lxc.mount.auto=proc:rw sys:rw
    lxc.cap.drop=
  security.nesting: "true"
  security.privileged: "true"
description: ""
devices:
  aadisable:
    path: /sys/module/nf_conntrack/parameters/hashsize
    source: /dev/null
    type: disk
  aadisable1:
    path: /sys/module/apparmor/parameters/enabled
    source: /dev/null
    type: disk
  root:
    path: /
    pool: conjureup
    type: disk
name: juju-frank-bat
used_by:
- /1.0/containers/juju-523d1d-0
- /1.0/containers/juju-523d1d-1
- /1.0/containers/juju-523d1d-2
- /1.0/containers/juju-523d1d-3

It’s worrth noting we are also leveraging the squashfuse package since we’re running snapd in lxd in this setup. this is implied with layer-snap.

Manual attempts to install the snap package return > 0 as mentioned above.

ubuntu@juju-523d1d-2:~$ sudo snap install kube-apiserver --channel=1.6/stable
2017-06-05T18:08:11Z INFO cannot auto connect kube-apiserver:network-bind to core:network-bind: (plug auto-connection), existing connection state "kube-apiserver:network-bind core:network-bind" in the way
error: cannot perform the following tasks:
- Run configure hook of "kube-apiserver" snap if present (run hook "configure": cannot create lock directory /run/snapd/lock: Permission denied)
ubuntu@juju-523d1d-2:~$ echo $?
1

and relevant entries from dmesg

[45524.016400] audit: type=1400 audit(1496685928.577:30290): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=25977 comm="apparmor_parser"
[45524.042417] audit: type=1400 audit(1496685928.605:30291): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=25977 comm="apparmor_parser"
[45524.047111] audit: type=1400 audit(1496685928.609:30292): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=25979 comm="apparmor_parser"
[45687.503631] audit: type=1400 audit(1496686092.055:30293): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.daemon" pid=29891 comm="apparmor_parser"
[45687.571116] audit: type=1400 audit(1496686092.123:30294): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.hook.configure" pid=29893 comm="apparmor_parser"
[45687.669554] audit: type=1400 audit(1496686092.219:30295): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.kube-apiserver.kube-apiserver" pid=29895 comm="apparmor_parser"
[45688.274890] audit: type=1400 audit(1496686092.823:30296): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=29956 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45688.362630] audit: type=1400 audit(1496686092.911:30297): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=30067 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45688.491876] audit: type=1400 audit(1496686093.043:30298): apparmor="DENIED" operation="mkdir" profile="/usr/lib/snapd/snap-confine" name="/run/snapd/lock/" pid=30072 comm="snap-confine" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[45689.426779] audit: type=1400 audit(1496686093.975:30299): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine" pid=30278 comm="apparmor_parser"
[45689.443265] audit: type=1400 audit(1496686093.995:30300): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/1689/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=30278 comm="apparmor_parser"
[45689.447997] audit: type=1400 audit(1496686093.999:30301): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=30280 comm="apparmor_parser"
1 Like

Woooo! Finally someone can see this problem live! Thanks, @lazypower!

@zyga-snapd Have fun!

1 Like

Hey!

Thank you for providing detailed information. I would appreciate if you could add some more: snap version on the machine just before you start installing things. I’ve also noticed that you start with snapd 2.21, is this old because juju doesn’t apply updates or for some other reason? What is the kernel running all of those? Is lxd coming from some PPA? On 16.04 I see 2.13-0ubuntu3~ubuntu16.04.1.

Also as a small favor, can you please edit your post and use triple backticks for the copy-pasted sections? It is easier on my eyes to read

Ah, the 2.21 is on the host machine, the 2.25 appears to be in the lxd container where those initial commands were executed.

There’s a couple layers here for things to potentially go awry.

According to uname -a i’m running a 4.8.0-53 generic kernel

$ uname -a
Linux carbon 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Additionally, when I probe for the lxd version it appears the version of conjure-up i’m using pulled in a PPA for lxd. So that apt-get upgade after snap installing conjure-up was the culprit of the PPA version thats incoming. conjure-up needs a fairly recent version of lxd.

$ apt-cache policy lxd
lxd:
  Installed: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Candidate: 2.14-0ubuntu1~ubuntu16.04.1~ppa1
  Version table:
 *** 2.14-0ubuntu1~ubuntu16.04.1~ppa1 500
        500 http://ppa.launchpad.net/ubuntu-lxc/lxd-stable/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status
     2.13-0ubuntu3~ubuntu16.04.1 100
        100 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 Packages
     2.0.9-0ubuntu1~16.04.2 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2-0ubuntu1~16.04.1 500
        500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
     2.0.0-0ubuntu4 500
        500 http://us.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

My gut feeling says that we’re running into a lxd/snapd bug where the apparmor profile from the outside of the system confines snap-confine inside the container. I’m working on a test case.