Call for testing of the docker snap

Hi,

The new version of the docker snap defaults to using aufs as it’s storage driver, however this is configurable using the daemon config file located at $SNAP_DATA/config/daemon.json. Can you try modifying this file to specify using btrfs as the storage-driver? I.e. try modifying your daemon.json file to show this:

{
  "log-level": "error",
  "storage-driver": "btrfs"
}
2 Likes

Once I refreshed into the candidate channel, I changed the value for storage-driver: in /var/snap/docker/current/config/daemon.json from aufs to btrfs, issued a
$ snap restart docker.dockerd, and it’s all working perfectly. Thanks for your help!

2 Likes

I’m snagging on running containers that try to bind mount host paths from places like /etc/ and /var. In the first instance I was able to make the dir myself prior to running the container but it’s several minutes later into the script that I get the second identical error at a different path.

Error like this: `Error response from daemon: error while creating mount source path ‘/var/lib/etcd’: mkdir /var/lib/etcd: permission denied". Running privileged container of rancher-agent.

Hi,

So what you are trying to do is mount paths from the host file system such as /etc or /var into the container? Unfortunately this goes against snap design and security access, and so these will be denied by AppArmor. If you can provide a more complete explanation of what you’re trying to do, there may be a way to do the same thing without performing arbitrary bind mount paths.

Hey, thanks for your reply.

This is part of the deployment process for a Rancher-managed Kubernetes deployment - the rancher agent Docker container creates bind mounts at /etc/kubernetes and /var/lib/etcd. I’ve manually created the first path and that worked, but it’s not loving me creating the second path myself unfortunately. I will chat with the Rancher folks to get their side of the story, and have a look myself at exactly what the container is up to, but I’d be interested to hear any suggestions you may have, thanks!

Hey, thanks for taking on updating the docker snap!

We are trying to deploy a docker stack from a swarm manager to a node running 18.06/stable on core16. We’ve been able to deploy previously on 17.09/candidate, but now we’re getting the following error on deploy:

msg="fatal task error" error="mkdir /var/lib/docker: read-only file system" module=node/agent/taskmanager
msg="Peer operation failed:Unable to find the peerDB for nid:2odz3zmrpyxb6hqkz7nnahi8m op:&{3 2odz3zmrpyxb6hqkz7nnahi8m  [] [] [] [] false false false func1}"
msg="state changed" module=node/agent/taskmanager state.desired=RUNNING state.transition="PREPARING->REJECTED"

This is surprising because we can confirm that the volumes, networks, containers, et al have been deployed to the snap/docker/common/var-lib-docker location

Can you advise?

1 Like

I’m seeing another error thrown frequently in our syslog on a worker node connected to a swarm manager:

Oct 16 12:39:16 myhost docker.dockerd[8211]: time="2018-10-16T12:39:16.762233229Z" level=warning msg="failed to retrieve docker-runc version: unknown output format: runc version 1.0.0-rc4+dev
Oct 16 12:39:16 myhost docker.dockerd[8211]: spec: 1.0.0
Oct 16 12:39:16 myhost docker.dockerd[8211]: "

Are others seeing this? I’m also seeing it when I run 17.09/candidate, but my swarm worker’s workload does run and function properly (I still am unable to get it working on 18.06/stable).

Hi,

Unfortunately I haven’t seen these errors before, but can you share some more information, specifically:

snap info core
cat /var/snap/docker/current/config/daemon.json

I would suggest turning on debugging output from dockerd by modifying the log-level key in the $SNAP_DATA/config/daemon.json file to be "debug", restart dockerd, and then send me full dockerd logs if/when the problem occurs again. If these logs are large, you can use something like pastebin.ubuntu.com or send them to me over direct message on the forum.

Lastly, if you could provide a reproducer that would be helpful, as there are many ways to “deploy”, and so I’m not sure what docker commands you are running exactly.

Hey Ian, thanks for the response. Wasn’t quite sure what you need, appreciate you clarifying.

The below describes a swarm setup with a manager running 18.06 on amazon linux 2, and two workers, one running 17.09/candidate and the other 18.06/stable, each on the latest version of ubuntu core.

Both workers connect to the swarm successfully (although some errors are present in syslog on 18.06/stable). After connecting the workers to swarm I deploy a stack on the manager, which is a simple nginx container from the official image. It deploys successfully to 17.09/candidate and fails on 18.06/stable.

On the 18.06 box:

> snap info core
ndsi.ubuntu.admin@nsentinel-dennis:~$ snap info core
name:      core
summary:   snapd runtime environment
publisher: Canonical✓
contact:   snaps@canonical.com
license:   unset
description: |
  The core runtime environment for snapd
type:         core
snap-id:      99T7MUlRhtI3U0QFgl5mXXESAiSwt776
tracking:     stable
refresh-date: 14 days ago, at 15:22 UTC
channels:
  stable:    16-2.35.2                   (5548) 92MB -
  candidate: 16-2.35.4                   (5662) 92MB -
  beta:      16-2.35.5                   (5742) 92MB -
  edge:      16-2.36~pre2+git959.a006992 (5731) 92MB -
installed:   16-2.35.2                   (5548) 92MB core

and

> cat /var/snap/docker/current/config/daemon.json
{
    "log-level":        "debug",
    "storage-driver":   "overlay2",
    "experimental":     true,
    "labels":           ["hostname=myhost"],
    "metrics-addr":     "127.0.0.1:9323"
}

again on 18.06 core system:

> sudo snap start docker
syslog output (note the apparmour error): https://pastebin.ubuntu.com/p/spDMrqXY9R/

and on 18.06 core system (this is a necessary step for us because of our VPN setup):

> sudo docker network create \
--subnet 10.11.0.0/16 \
--opt com.docker.network.bridge.name=docker_gwbridge \
--opt com.docker.network.bridge.enable_icc=false \
--opt com.docker.network.bridge.enable_ip_masquerade=true \
docker_gwbridge
output: https://pastebin.ubuntu.com/p/tvK892k6sv/

and finally on 18.06 core system:

> sudo docker swarm join --token SWMTKN-redacted 10.100.0.1:2377
This node joined a swarm as a worker.
syslog output: https://pastebin.ubuntu.com/p/BHXytfQxfJ/

now on the swarm manager:

> docker stack deploy -c compose.nginx.yml

Contents of compose.nginx.yml:

version: "3.5"

services:

  web:
    image: nginx
    ports:
     - "8080:80"
    environment:
     - NGINX_HOST=foobar.com
     - NGINX_PORT=80
    command: [nginx, '-g', 'daemon off;']

on 17.09/candidate this deploys successfully!

on the 18.06/stable core machine, this is the syslog output:

https://pastebin.ubuntu.com/p/98MGPB2XfV/

I haven’t had a chance to reproduce this yet, but a couple of things to point out:

  1. The apparmor messages about /bin/kmod can safely be ignored, these are expected as in the dockerd wrapper we have to do some elaborate attempts to get the kernel to load some storage driver kernel modules for us, and an unfortunate side effect of this is that dockerd tries to still call /bin/kmod, which is denied due to the AppArmor. Since it happens pretty consistently I will probably have that apparmor audit denied so it keeps being denied, but at least doesn’t still log it as it’s normal.
  2. I see you are using the overlay2 storage driver. Have you tried using aufs or overlay? I have had troubles with using overlay2 before, and the default in the snap is to use aufs.
  3. The dockerd warning about the unknown runc message I’m pretty sure is harmless and can be ignored.
  4. As you originally mentioned, the issue surrounding the /var/lib/docker being a read-only message is confusing as I have looked everywhere in the dockerd source code for /var/lib/docker and the only place in actual code is from a default value for the --data-root command line variable, which in the snap we explicitly set to be $SNAP_COMMON/var-lib-docker. Is it possible that anywhere in your scripts/etc. you are using you have that value set somewhere? It’s also possible this value is coming from somewhere external to dockerd, but I’m still investigating that.

I have found that trying to deploy Traefik via docker reproduces the error=“mkdir /var/lib/docker: read-only file system” issue 100% of the time for me.

  1. Ok!
  2. I haven’t tried overlay, I will, and will follow up. aufs doesn’t work. We opted for overlay2 over aufs because aufs has a listed known kernel crashing error on the official docker documentation.
  3. Ok!
  4. In the reproducible there are no additional scripts, just the official nginx image and no build steps, so it seems a reasonable guess that it’s coming from something external to dockerd, something related to swarm I’d guess:
    version: "3.5"

    services:

      web:
        image: nginx
        ports:
         - "8080:80"
        environment:
         - NGINX_HOST=foobar.com
         - NGINX_PORT=80
        command: [nginx, '-g', 'daemon off;']

Hey @ijohnson any new insight?

Same issue presents with overlay

So what is you solution? snap design complicates things a lot , specially docker if we cannot mount paths from host.

Hope we will not have to bind mount everything in media or home …

Hey @ijohnson have you made any progress?

No, I have not yet been able to determine the cause of this issue or any workaround.

The Snap doesn’t work at all for me when using the overlay2 storage. I’m a beginner at Docker, but then again it should just work even if installed from Snap. I have to use Overlay2 storage since my filesystem is btrfs (aufs isn’t supported on btrfs, Docker’s docs say only overlay2 or aufs is supported on Ubuntu)
sudo snap install docker

/var/snap/docker/current/config/daemon.json:
{
    "log-level":        "debug",
    "storage-driver":   "overlay2"
}

sudo systemctl stop snap.docker.dockerd
sudo systemctl start snap.docker.dockerd

$ sudo docker run hello-world
Unable to find image ‘hello-world:latest’ locally
latest: Pulling from library/hello-world
1b930d010525: Already exists
Digest: sha256:2557e3c07ed1e38f26e389462d03ed943586f744621577a99efb77324b0fe535
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused “process_linux.go:402: container init caused “rootfs_linux.go:109: jailing proc$ss inside rootfs caused \“permission denied\”””: unknown.
ERRO[0004] error waiting for container: context canceled

Hi,

Do you see any apparmor denials when you run into this problem? I.e. what does the following show:

journalctl --no-pager -e -k | grep apparmor | grep -v kmod | grep snap.docker.dockerd

Hi,

I followed the same steps I listed above to reproduce the problem:

$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:2557e3c07ed1e38f26e389462d03ed943586f744621577a99efb77324b0fe535
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"permission denied\\\"\"": unknown.
ERRO[0004] error waiting for container: context canceled 

The Apparmor syslog messages:

$ journalctl --no-pager -e -k | grep apparmor | grep -v kmod | grep snap.docker.dockerd
Jan 15 23:25:22 localhost kernel: audit: type=1400 audit(1547591122.578:23266): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.docker.dockerd" pid=24945 comm="apparmor_parser"
Jan 15 23:25:28 localhost kernel: audit: type=1400 audit(1547591128.486:23288): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.docker.dockerd" pid=25250 comm="apparmor_parser"
Jan 15 23:25:34 localhost kernel: audit: type=1400 audit(1547591134.559:23317): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.docker.dockerd" pid=25511 comm="apparmor_parser"
Jan 15 23:25:43 localhost kernel: audit: type=1400 audit(1547591143.047:23361): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="snap.docker.dockerd" name="docker-default" pid=25939 comm="apparmor_parser"
Jan 15 23:25:43 localhost kernel: audit: type=1400 audit(1547591143.415:23364): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="snap.docker.dockerd" name="docker-default" pid=26037 comm="apparmor_parser"
Jan 15 23:25:43 localhost kernel: audit: type=1400 audit(1547591143.887:23367): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="snap.docker.dockerd" name="docker-default" pid=26124 comm="apparmor_parser"
Jan 15 23:33:29 localhost kernel: audit: type=1400 audit(1547591609.053:23377): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="snap.docker.dockerd" name="docker-default" pid=945 comm="apparmor_parser"
Jan 15 23:33:41 localhost kernel: audit: type=1400 audit(1547591621.721:23387): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="snap.docker.dockerd" name="docker-default" pid=1275 comm="apparmor_parser"
Jan 15 23:33:42 localhost kernel: audit: type=1400 audit(1547591622.329:23392): apparmor="DENIED" operation="open" profile="snap.docker.dockerd" name="/@/var/snap/docker/common/var-lib-docker/overlay2/cdf26482d3545a13f95e18e82f13385a824d6cb0cfd789b95f9db1525f7c5108/diff/" pid=1307 comm="runc:[2:INIT]" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

Thanks for the output. The last denial is the culprit here it seems. It’s very odd that the path that docker attempts to access is name="/@/var/snap/docker/common/var-lib-docker/overlay2/cdf26482d3545a13f95e18e82f13385a824d6cb0cfd789b95f9db1525f7c5108/diff/
The filepath it should be using and the one it has access to is "/var/snap/docker/common/*", I’m not sure why that leading "/@" is there, but that’s why it got denied.

I’ll have to look into this more, but thanks for the info. I will post back if I’m able to figure out a fix for this or a workaround for you.

1 Like