Call for testing of the docker snap

ijohnson · October 5, 2018, 9:33pm

The docker-snap has a new maintainer and a new home @ github.com/docker-snap/docker-snap, please direct inquiries or issues there going forward, thanks!

Original post:
Hi all,

I’ve uploaded version 18.06 of the docker snap to the latest/candidate channel and would like to promote to stable shortly. It is available to install on all platforms with:

sudo snap install docker --candidate

If you run into any problems with this revision, please comment here.
A short list of updates to the snap:

the version of docker has been updated to 18.06-ce
git is now included in the snap so that you can build from git URL’s directly
the daemon config file is specified directly when launching, so any changes to this config file (located at $SNAP_DATA/config/daemon.json) are included
zfs-tools is now included in the snap, this should enable one to use zfs backend storage driver, however this hasn’t been tested extensively so is not guaranteed to work
docker.compose is now aliased as docker-compose and docker-machine should be treated similarly soon (see this post)
when using sudo docker, you can access files outside of root’s home directory
the docker command now auto-completes correctly
the default log-level is now error (instead of debug), but this can be changed easily using the above mentioned config file
the default storage-driver is aufs, this also can be changed using the above mentioned config file

Thanks,
Ian

andy · October 6, 2018, 6:21pm

While trying to run a basic docker.compose up I get an error:

ERROR: Couldn't connect to Docker daemon - you might need to run `docker-machine start default`.

So checking the logs snap logs docker.dockerd, I get:

2018-10-06T18:13:31Z docker.dockerd[32648]: time="2018-10-06T11:13:31.550863810-07:00" level=debug msg="Cleaning up old mountid : start."
2018-10-06T18:13:31Z docker.dockerd[32648]: Error starting daemon: error initializing graphdriver: backing file system is unsupported for this graph driver
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Main process exited, code=exited, status=1/FAILURE
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Service hold-off time over, scheduling restart.
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Scheduled restart job, restart counter is at 5.
2018-10-06T18:13:31Z systemd[1]: Stopped Service for snap application docker.dockerd.
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Start request repeated too quickly.
2018-10-06T18:13:31Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2018-10-06T18:13:31Z systemd[1]: Failed to start Service for snap application docker.dockerd.

I’m using BTRFS, and the stable channel 17.06.2-ce (179), which is very happy to work normally with the BTRFS storage driver. Reading further back in the logs with journalctl -u snap.docker.dockerd.service the problem comes from docker trying to force aufs over my BTRFS filesystem and it failing. I’ll see if I can figure out how to change that in a config file?

EDIT: Looking at the differences in the /etc/ of each snap, there’s a lot more stuff in (321) than (179), so I’m not sure where to begin. The /etc/default/aufs files are identical according to diff.

ijohnson · October 8, 2018, 8:26pm

Hi,

The new version of the docker snap defaults to using aufs as it’s storage driver, however this is configurable using the daemon config file located at $SNAP_DATA/config/daemon.json. Can you try modifying this file to specify using btrfs as the storage-driver? I.e. try modifying your daemon.json file to show this:

{
  "log-level": "error",
  "storage-driver": "btrfs"
}

andy · October 9, 2018, 4:02am

Once I refreshed into the candidate channel, I changed the value for storage-driver: in /var/snap/docker/current/config/daemon.json from aufs to btrfs, issued a
$ snap restart docker.dockerd, and it’s all working perfectly. Thanks for your help!

seffyroff · October 12, 2018, 7:55am

I’m snagging on running containers that try to bind mount host paths from places like /etc/ and /var. In the first instance I was able to make the dir myself prior to running the container but it’s several minutes later into the script that I get the second identical error at a different path.

Error like this: `Error response from daemon: error while creating mount source path ‘/var/lib/etcd’: mkdir /var/lib/etcd: permission denied". Running privileged container of rancher-agent.

ijohnson · October 12, 2018, 11:50am

Hi,

So what you are trying to do is mount paths from the host file system such as /etc or /var into the container? Unfortunately this goes against snap design and security access, and so these will be denied by AppArmor. If you can provide a more complete explanation of what you’re trying to do, there may be a way to do the same thing without performing arbitrary bind mount paths.

seffyroff · October 12, 2018, 4:20pm

Hey, thanks for your reply.

This is part of the deployment process for a Rancher-managed Kubernetes deployment - the rancher agent Docker container creates bind mounts at /etc/kubernetes and /var/lib/etcd. I’ve manually created the first path and that worked, but it’s not loving me creating the second path myself unfortunately. I will chat with the Rancher folks to get their side of the story, and have a look myself at exactly what the container is up to, but I’d be interested to hear any suggestions you may have, thanks!

cailen · October 15, 2018, 10:45pm

Hey, thanks for taking on updating the docker snap!

We are trying to deploy a docker stack from a swarm manager to a node running 18.06/stable on core16. We’ve been able to deploy previously on 17.09/candidate, but now we’re getting the following error on deploy:

msg="fatal task error" error="mkdir /var/lib/docker: read-only file system" module=node/agent/taskmanager
msg="Peer operation failed:Unable to find the peerDB for nid:2odz3zmrpyxb6hqkz7nnahi8m op:&{3 2odz3zmrpyxb6hqkz7nnahi8m  [] [] [] [] false false false func1}"
msg="state changed" module=node/agent/taskmanager state.desired=RUNNING state.transition="PREPARING->REJECTED"

This is surprising because we can confirm that the volumes, networks, containers, et al have been deployed to the snap/docker/common/var-lib-docker location

Can you advise?

cailen · October 16, 2018, 12:52pm

I’m seeing another error thrown frequently in our syslog on a worker node connected to a swarm manager:

Oct 16 12:39:16 myhost docker.dockerd[8211]: time="2018-10-16T12:39:16.762233229Z" level=warning msg="failed to retrieve docker-runc version: unknown output format: runc version 1.0.0-rc4+dev
Oct 16 12:39:16 myhost docker.dockerd[8211]: spec: 1.0.0
Oct 16 12:39:16 myhost docker.dockerd[8211]: "

Are others seeing this? I’m also seeing it when I run 17.09/candidate, but my swarm worker’s workload does run and function properly (I still am unable to get it working on 18.06/stable).

ijohnson · October 16, 2018, 4:50pm

Hi,

Unfortunately I haven’t seen these errors before, but can you share some more information, specifically:

snap info core

cat /var/snap/docker/current/config/daemon.json

I would suggest turning on debugging output from dockerd by modifying the log-level key in the $SNAP_DATA/config/daemon.json file to be "debug", restart dockerd, and then send me full dockerd logs if/when the problem occurs again. If these logs are large, you can use something like pastebin.ubuntu.com or send them to me over direct message on the forum.

Lastly, if you could provide a reproducer that would be helpful, as there are many ways to “deploy”, and so I’m not sure what docker commands you are running exactly.

cailen · October 17, 2018, 2:35pm

Hey Ian, thanks for the response. Wasn’t quite sure what you need, appreciate you clarifying.

The below describes a swarm setup with a manager running 18.06 on amazon linux 2, and two workers, one running 17.09/candidate and the other 18.06/stable, each on the latest version of ubuntu core.

Both workers connect to the swarm successfully (although some errors are present in syslog on 18.06/stable). After connecting the workers to swarm I deploy a stack on the manager, which is a simple nginx container from the official image. It deploys successfully to 17.09/candidate and fails on 18.06/stable.

On the 18.06 box:

> snap info core
ndsi.ubuntu.admin@nsentinel-dennis:~$ snap info core
name:      core
summary:   snapd runtime environment
publisher: Canonical✓
contact:   snaps@canonical.com
license:   unset
description: |
  The core runtime environment for snapd
type:         core
snap-id:      99T7MUlRhtI3U0QFgl5mXXESAiSwt776
tracking:     stable
refresh-date: 14 days ago, at 15:22 UTC
channels:
  stable:    16-2.35.2                   (5548) 92MB -
  candidate: 16-2.35.4                   (5662) 92MB -
  beta:      16-2.35.5                   (5742) 92MB -
  edge:      16-2.36~pre2+git959.a006992 (5731) 92MB -
installed:   16-2.35.2                   (5548) 92MB core

and

> cat /var/snap/docker/current/config/daemon.json
{
    "log-level":        "debug",
    "storage-driver":   "overlay2",
    "experimental":     true,
    "labels":           ["hostname=myhost"],
    "metrics-addr":     "127.0.0.1:9323"
}

again on 18.06 core system:

> sudo snap start docker
syslog output (note the apparmour error): https://pastebin.ubuntu.com/p/spDMrqXY9R/

and on 18.06 core system (this is a necessary step for us because of our VPN setup):

> sudo docker network create \
--subnet 10.11.0.0/16 \
--opt com.docker.network.bridge.name=docker_gwbridge \
--opt com.docker.network.bridge.enable_icc=false \
--opt com.docker.network.bridge.enable_ip_masquerade=true \
docker_gwbridge
output: https://pastebin.ubuntu.com/p/tvK892k6sv/

and finally on 18.06 core system:

> sudo docker swarm join --token SWMTKN-redacted 10.100.0.1:2377
This node joined a swarm as a worker.
syslog output: https://pastebin.ubuntu.com/p/BHXytfQxfJ/

now on the swarm manager:

> docker stack deploy -c compose.nginx.yml

Contents of compose.nginx.yml:

version: "3.5"

services:

  web:
    image: nginx
    ports:
     - "8080:80"
    environment:
     - NGINX_HOST=foobar.com
     - NGINX_PORT=80
    command: [nginx, '-g', 'daemon off;']

on 17.09/candidate this deploys successfully!

on the 18.06/stable core machine, this is the syslog output:

https://pastebin.ubuntu.com/p/98MGPB2XfV/

ijohnson · October 17, 2018, 3:23pm

I haven’t had a chance to reproduce this yet, but a couple of things to point out:

The apparmor messages about /bin/kmod can safely be ignored, these are expected as in the dockerd wrapper we have to do some elaborate attempts to get the kernel to load some storage driver kernel modules for us, and an unfortunate side effect of this is that dockerd tries to still call /bin/kmod, which is denied due to the AppArmor. Since it happens pretty consistently I will probably have that apparmor audit denied so it keeps being denied, but at least doesn’t still log it as it’s normal.
I see you are using the overlay2 storage driver. Have you tried using aufs or overlay? I have had troubles with using overlay2 before, and the default in the snap is to use aufs.
The dockerd warning about the unknown runc message I’m pretty sure is harmless and can be ignored.
As you originally mentioned, the issue surrounding the /var/lib/docker being a read-only message is confusing as I have looked everywhere in the dockerd source code for /var/lib/docker and the only place in actual code is from a default value for the --data-root command line variable, which in the snap we explicitly set to be $SNAP_COMMON/var-lib-docker. Is it possible that anywhere in your scripts/etc. you are using you have that value set somewhere? It’s also possible this value is coming from somewhere external to dockerd, but I’m still investigating that.

miwagner1 · October 18, 2018, 3:26am

I have found that trying to deploy Traefik via docker reproduces the error=“mkdir /var/lib/docker: read-only file system” issue 100% of the time for me.

cailen · October 22, 2018, 6:38pm

Ok!
I haven’t tried overlay, I will, and will follow up. aufs doesn’t work. We opted for overlay2 over aufs because aufs has a listed known kernel crashing error on the official docker documentation.
Ok!
In the reproducible there are no additional scripts, just the official nginx image and no build steps, so it seems a reasonable guess that it’s coming from something external to dockerd, something related to swarm I’d guess:

    version: "3.5"

    services:

      web:
        image: nginx
        ports:
         - "8080:80"
        environment:
         - NGINX_HOST=foobar.com
         - NGINX_PORT=80
        command: [nginx, '-g', 'daemon off;']

cailen · October 29, 2018, 6:09pm

Hey @ijohnson any new insight?

Same issue presents with overlay

gquentin · October 31, 2018, 8:52am

So what is you solution? snap design complicates things a lot , specially docker if we cannot mount paths from host.

Hope we will not have to bind mount everything in media or home …

cailen · November 14, 2018, 7:39pm

Hey @ijohnson have you made any progress?

ijohnson · November 15, 2018, 12:28pm

No, I have not yet been able to determine the cause of this issue or any workaround.

tk83 · January 2, 2019, 12:34am

The Snap doesn’t work at all for me when using the overlay2 storage. I’m a beginner at Docker, but then again it should just work even if installed from Snap. I have to use Overlay2 storage since my filesystem is btrfs (aufs isn’t supported on btrfs, Docker’s docs say only overlay2 or aufs is supported on Ubuntu)
sudo snap install docker

/var/snap/docker/current/config/daemon.json:
{
    "log-level":        "debug",
    "storage-driver":   "overlay2"
}

sudo systemctl stop snap.docker.dockerd
sudo systemctl start snap.docker.dockerd

$ sudo docker run hello-world
Unable to find image ‘hello-world:latest’ locally
latest: Pulling from library/hello-world
1b930d010525: Already exists
Digest: sha256:2557e3c07ed1e38f26e389462d03ed943586f744621577a99efb77324b0fe535
Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused “process_linux.go:402: container init caused “rootfs_linux.go:109: jailing proc$ss inside rootfs caused \“permission denied\”””: unknown.
ERRO[0004] error waiting for container: context canceled

ijohnson · January 14, 2019, 8:41pm

Hi,

Do you see any apparmor denials when you run into this problem? I.e. what does the following show:

journalctl --no-pager -e -k | grep apparmor | grep -v kmod | grep snap.docker.dockerd