Docker fails to restart after power failure

We have observed that dockerd fails to start after power failure. It gets into a restart loop:

$ snap logs docker
2020-07-27T16:57:59Z systemd[1]: snap.docker.dockerd.service: Scheduled restart job, restart counter is at 17690.
2020-07-27T16:57:59Z systemd[1]: Stopped Service for snap application docker.dockerd.
2020-07-27T16:57:59Z systemd[1]: Started Service for snap application docker.dockerd.
2020-07-27T16:58:14Z docker.dockerd[14461]: failed to start containerd: timeout waiting for containerd to start
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Main process exited, code=exited, status=1/FAILURE
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Service hold-off time over, scheduling restart.

From journalctl:

$ journalctl -u snap.docker.dockerd.service | grep containerd-shim
Jul 21 17:08:28 1stzoom-edge docker.dockerd[876]: /containerd-shim/moby/bb15eb99d37b96066522c5a13af5e3443d886751b686e9f2421de667f00b5b4b/shim.sock: connect: connection refused" id=bb15eb99d37b96066522c5a13af5e3443d886751b686e9f2421de667f00b5b4b namespace=moby

Docker snap version:

$ snap info docker
installed:          19.03.11                (471) 131MB -

Do we track docker-snap issues somewhere? I have reported a few issues in the past here, but they seem to get ignored after some time.

do you have any system journal denials? journalctl -e --no-pager | grep DENIED

Thanks @ijohnson for your response. I did not find any instance of denials.

are you able to change the daemon.json for docker at /var/snap/docker/current/config/daemon.json to include debug logging as well?

Thanks for the hint. I enabled debugging and found the following two lines in the log:

Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466730064+05:30" level=debug msg="Containerd not running, starting daemon managed containerd"
Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466900881+05:30" level=info msg="libcontainerd: containerd is still running" pid=1330

And sure enough there was a /var/snap/docker/current/run/docker/containerd/ After I manually deleted the pid file, dockerd was able to start fine.

Is /var/snap/docker/current/run persistent?

Yes, files under /var/snap/$SNAPNAME/current and .../common are persistent while the snap is installed. They will be deleted when you remove a snap (and saved for easy restoring. You can supply the --purge flag to snap remove to permanently delete the files or use snap saved and snap forget to drop them after the fact if you didn’t purge. they’ll also drop automatically after a period of days if untouched).

Perhaps it would be a good idea for the runtime environment set-up by snapd to include a tmpfs or portion of the real /run for /run inside the snap’s worldview. @popey, can you poke the right folk to consider this (I forget who the right people are, though @zyga comes to mind)

Hmm… A persistent run directory seems flaky to me. It will break apps that assume a tmpfs-mounted run directory. I think this is the main issue here - containerd does not cleanup the pid file; it assumes that the cleanup will happen automatically during reboot (not sure how it handles crashes).

Out of curiosity, docker-snap can already write to /var/run/docker.sock and /var/run/docker/, so why do we need a run directory under /var/snap/?

This sounds more like an oversight in the docker snap’s setup, if it is storing things in $SNAP_DATA/run that are meant to be temporary like this, as @alok says it’s doing the wrong thing. Probably an easy fix for this would be to adjust $SNAP_DATA/run to be a symlink to /run/snap.${SNAP_INSTANCE_NAME} instead.

All snaps should already have access to /run/snap.@{SNAP_INSTANCE_NAME}:

$ grep -P '/run/snap\.' /var/lib/snapd/apparmor/profiles/snap.hello-world.hello-world 
  /run/snap.@{SNAP_INSTANCE_NAME}/ rw,
  /run/snap.@{SNAP_INSTANCE_NAME}/** mrwklix,
1 Like

@ijohnson @tianon Is there a bug tracker for docker-snap I can use to report this issue?

I think this is because of this piece of code:

# modify XDG_RUNTIME_DIR to be a snap writable dir underneath $SNAP_COMMON 
# until LP #1656340 is fixed

@ijohnson @tianon what is LP #1656340 and is it still an issue?

There are other problematic settings:

exec dockerd \
	--exec-root="$SNAP_DATA/run/docker" \
	--pidfile="$SNAP_DATA/run/" \

The default setting for exec-root is /var/run/docker and pidfile is var/run/ Using persistent directories for them might be problematic.

Right we probably want @tianon to adjust that in a safe way that doesn’t break existing installs. Is there anything in --exec-root which is expected to be persistent? If not we could probably just switch it over directly in the cmdline of dockerd

The current docker-snap is almost unusable in edge environments with inconsistent power supply. It gets into a bad state pretty fast. Even restarting the docker daemon does not fix it. The only fix I have been successful with is to purge docker snap and reinstall. So docker-snap is not a production-grade implementation that we can rely on. We have decided to move away from using docker snap, which also meant moving away from Ubuntu core. We will re-evaluate in future if the maintenance of docker-snap improves. I am putting this here in case someone is evaluating using docker-snap in production environment.