We have observed that dockerd fails to start after power failure. It gets into a restart loop:
$ snap logs docker
2020-07-27T16:57:59Z systemd[1]: snap.docker.dockerd.service: Scheduled restart job, restart counter is at 17690.
2020-07-27T16:57:59Z systemd[1]: Stopped Service for snap application docker.dockerd.
2020-07-27T16:57:59Z systemd[1]: Started Service for snap application docker.dockerd.
2020-07-27T16:58:14Z docker.dockerd[14461]: failed to start containerd: timeout waiting for containerd to start
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Main process exited, code=exited, status=1/FAILURE
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2020-07-27T16:58:14Z systemd[1]: snap.docker.dockerd.service: Service hold-off time over, scheduling restart.
Thanks for the hint. I enabled debugging and found the following two lines in the log:
Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466730064+05:30" level=debug msg="Containerd not running, starting daemon managed containerd"
Jul 29 04:25:57 1stzoom-edge docker.dockerd[26033]: time="2020-07-29T04:25:57.466900881+05:30" level=info msg="libcontainerd: containerd is still running" pid=1330
And sure enough there was a /var/snap/docker/current/run/docker/containerd/containerd.pid. After I manually deleted the pid file, dockerd was able to start fine.
Yes, files under /var/snap/$SNAPNAME/current and .../common are persistent while the snap is installed. They will be deleted when you remove a snap (and saved for easy restoring. You can supply the --purge flag to snap remove to permanently delete the files or use snap saved and snap forget to drop them after the fact if you didn’t purge. they’ll also drop automatically after a period of days if untouched).
Perhaps it would be a good idea for the runtime environment set-up by snapd to include a tmpfs or portion of the real /run for /run inside the snap’s worldview. @popey, can you poke the right folk to consider this (I forget who the right people are, though @zyga-snapd comes to mind)
Hmm… A persistent run directory seems flaky to me. It will break apps that assume a tmpfs-mounted run directory. I think this is the main issue here - containerd does not cleanup the pid file; it assumes that the cleanup will happen automatically during reboot (not sure how it handles crashes).
Out of curiosity, docker-snap can already write to /var/run/docker.sock and /var/run/docker/, so why do we need a run directory under /var/snap/?
This sounds more like an oversight in the docker snap’s setup, if it is storing things in $SNAP_DATA/run that are meant to be temporary like this, as @alok says it’s doing the wrong thing. Probably an easy fix for this would be to adjust $SNAP_DATA/run to be a symlink to /run/snap.${SNAP_INSTANCE_NAME} instead.
All snaps should already have access to /run/snap.@{SNAP_INSTANCE_NAME}:
Right we probably want @tianon to adjust that in a safe way that doesn’t break existing installs. Is there anything in --exec-root which is expected to be persistent? If not we could probably just switch it over directly in the cmdline of dockerd
The current docker-snap is almost unusable in edge environments with inconsistent power supply. It gets into a bad state pretty fast. Even restarting the docker daemon does not fix it. The only fix I have been successful with is to purge docker snap and reinstall. So docker-snap is not a production-grade implementation that we can rely on. We have decided to move away from using docker snap, which also meant moving away from Ubuntu core. We will re-evaluate in future if the maintenance of docker-snap improves. I am putting this here in case someone is evaluating using docker-snap in production environment.
@alok the docker snap is now being more actively maintained at github.com/docker-snap/docker-snap, so could you file an issue there about moving --exec-root to /run/snap.$SNAP_INSTANCE_NAME there? Thanks