Docker fails to restart after power failure

alok · July 27, 2020, 5:10pm

We have observed that dockerd fails to start after power failure. It gets into a restart loop:

$ snap logs docker
2020-07-27T16:57:59Z systemd[1]: snap.docker.dockerd.service: Scheduled restart job, restart counter is at 17690.
2020-07-27T16:57:59Z systemd[1]: Stopped Service for snap application docker.dockerd.
2020-07-27T16:57:59Z systemd[1]: Started Service for snap application docker.dockerd.
2020-07-27T16:58:14Z docker.dockerd[14461]: failed to start containerd: timeout waiting for containerd to start
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Main process exited, code=exited, status=1/FAILURE
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Service hold-off time over, scheduling restart.

From journalctl:

$ journalctl -u snap.docker.dockerd.service | grep containerd-shim
Jul 21 17:08:28 1stzoom-edge docker.dockerd[876]: /containerd-shim/moby/bb15eb99d37b96066522c5a13af5e3443d886751b686e9f2421de667f00b5b4b/shim.sock: connect: connection refused" id=bb15eb99d37b96066522c5a13af5e3443d886751b686e9f2421de667f00b5b4b namespace=moby

Docker snap version:

$ snap info docker
installed:          19.03.11                (471) 131MB -

Do we track docker-snap issues somewhere? I have reported a few issues in the past here, but they seem to get ignored after some time.

ijohnson · July 28, 2020, 6:20pm

do you have any system journal denials? journalctl -e --no-pager | grep DENIED

alok · July 28, 2020, 7:17pm

Thanks @ijohnson for your response. I did not find any instance of denials.

ijohnson · July 28, 2020, 9:57pm

are you able to change the daemon.json for docker at /var/snap/docker/current/config/daemon.json to include debug logging as well?

alok · July 28, 2020, 11:07pm

Thanks for the hint. I enabled debugging and found the following two lines in the log:

Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466730064+05:30" level=debug msg="Containerd not running, starting daemon managed containerd"
Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466900881+05:30" level=info msg="libcontainerd: containerd is still running" pid=1330

And sure enough there was a /var/snap/docker/current/run/docker/containerd/containerd.pid. After I manually deleted the pid file, dockerd was able to start fine.

Is /var/snap/docker/current/run persistent?

lucyllewy · July 28, 2020, 11:44pm

Yes, files under /var/snap/$SNAPNAME/current and .../common are persistent while the snap is installed. They will be deleted when you remove a snap (and saved for easy restoring. You can supply the --purge flag to snap remove to permanently delete the files or use snap saved and snap forget to drop them after the fact if you didn’t purge. they’ll also drop automatically after a period of days if untouched).

Perhaps it would be a good idea for the runtime environment set-up by snapd to include a tmpfs or portion of the real /run for /run inside the snap’s worldview. @popey, can you poke the right folk to consider this (I forget who the right people are, though @zyga-snapd comes to mind)

alok · July 29, 2020, 12:15am

Hmm… A persistent run directory seems flaky to me. It will break apps that assume a tmpfs-mounted run directory. I think this is the main issue here - containerd does not cleanup the pid file; it assumes that the cleanup will happen automatically during reboot (not sure how it handles crashes).

Out of curiosity, docker-snap can already write to /var/run/docker.sock and /var/run/docker/, so why do we need a run directory under /var/snap/?

ijohnson · July 29, 2020, 12:45pm

This sounds more like an oversight in the docker snap’s setup, if it is storing things in $SNAP_DATA/run that are meant to be temporary like this, as @alok says it’s doing the wrong thing. Probably an easy fix for this would be to adjust $SNAP_DATA/run to be a symlink to /run/snap.${SNAP_INSTANCE_NAME} instead.

All snaps should already have access to /run/snap.@{SNAP_INSTANCE_NAME}:

$ grep -P '/run/snap\.' /var/lib/snapd/apparmor/profiles/snap.hello-world.hello-world 
  /run/snap.@{SNAP_INSTANCE_NAME}/ rw,
  /run/snap.@{SNAP_INSTANCE_NAME}/** mrwklix,

alok · July 29, 2020, 10:08pm

@ijohnson @tianon Is there a bug tracker for docker-snap I can use to report this issue?

alok · July 30, 2020, 10:49pm

I think this is because of this piece of code:

# modify XDG_RUNTIME_DIR to be a snap writable dir underneath $SNAP_COMMON 
# until LP #1656340 is fixed
export XDG_RUNTIME_DIR=$SNAP_COMMON/run

@ijohnson @tianon what is LP #1656340 and is it still an issue?

alok · July 30, 2020, 11:05pm

There are other problematic settings:

exec dockerd \
	--exec-root="$SNAP_DATA/run/docker" \
	--pidfile="$SNAP_DATA/run/docker.pid" \

The default setting for exec-root is /var/run/docker and pidfile is var/run/docker.pid. Using persistent directories for them might be problematic.

ijohnson · August 3, 2020, 1:44pm

Right we probably want @tianon to adjust that in a safe way that doesn’t break existing installs. Is there anything in --exec-root which is expected to be persistent? If not we could probably just switch it over directly in the cmdline of dockerd

alok · August 19, 2020, 7:09pm

The current docker-snap is almost unusable in edge environments with inconsistent power supply. It gets into a bad state pretty fast. Even restarting the docker daemon does not fix it. The only fix I have been successful with is to purge docker snap and reinstall. So docker-snap is not a production-grade implementation that we can rely on. We have decided to move away from using docker snap, which also meant moving away from Ubuntu core. We will re-evaluate in future if the maintenance of docker-snap improves. I am putting this here in case someone is evaluating using docker-snap in production environment.

ijohnson · October 1, 2020, 3:51pm

@alok the docker snap is now being more actively maintained at github.com/docker-snap/docker-snap, so could you file an issue there about moving --exec-root to /run/snap.$SNAP_INSTANCE_NAME there? Thanks

lucyllewy · October 8, 2020, 4:01pm

I’ve made a PR to hopefully address this problem: https://github.com/docker-snap/docker-snap/pull/9