The daemon Snap service cannot be restarted after an abort in a AWS Cloud instance

issacmgongora · January 18, 2024, 7:43pm

Hello everyone,

I created this topic after I didn’t find any similar topics about this matter.

I have a instance at AWS Cloud account with this version:

#cat /etc/os-release PRETTY_NAME=“Ubuntu 22.04.3 LTS” NAME=“Ubuntu” VERSION_ID=“22.04” VERSION=“22.04.3 LTS (Jammy Jellyfish)” VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL=“https://www.ubuntu.com/”

lsb_release -a

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy

And this version of snap

snap --version

snap 2.61.1 snapd 2.61.1 series 16 ubuntu 22.04 kernel 6.2.0-1017-aws

Also this configuration of the snapd service

Unit] Description=Snap Daemon After=snapd.socket After=time-set.target After=snapd.mounts.target Wants=time-set.target Wants=snapd.mounts.target Requires=snapd.socket OnFailure=snapd.failure.service

This is handled by snapd

X-Snapd-Snap: do-not-start

[Service]

Disabled because it breaks lxd

(Bug #1709536 “snapd 2.26.14 on ubuntu-core won't start in contai...” : Bugs : snapd)

#Nice=-5 OOMScoreAdjust=-900 ExecStart=/usr/lib/snapd/snapd EnvironmentFile=-/etc/environment Restart=always WatchdogSec=5m Type=notify SuccessExitStatus=42 RestartPreventExitStatus=42 KillMode=process

[Install] WantedBy=multi-user.target

In this instance, the snap daemon is unable to start after aborting the process ‘snapd’.

Jan 18 00:57:44 ip-10-0-21-221 systemd[1]: snapd.service: Watchdog timeout (limit 5min)! Jan 18 00:57:46 ip-10-0-21-221 systemd[1]: snapd.service: Killing process 448 (snapd) with signal SIGABRT. Jan 18 00:59:21 ip-10-0-21-221 snapd[448]: SIGABRT: abort Jan 18 00:59:21 ip-10-0-21-221 snapd[448]: PC=0xaaaad25ca41c m=0 sigcode=0 Jan 18 00:59:21 ip-10-0-21-221 snapd[448]: goroutine 0 [idle]:

Even though it tries to start again about 98 times, the service is still unable to start.

Jan 18 00:59:19 ip-10-0-21-221 systemd[1]: snapd.service: State ‘stop-watchdog’ timed out. Killing. Jan 18 00:59:20 ip-10-0-21-221 systemd[1]: snapd.service: Killing process 448 (snapd) with signal SIGKILL. Jan 18 00:59:21 ip-10-0-21-221 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL Jan 18 00:59:21 ip-10-0-21-221 systemd[1]: snapd.service: Failed with result ‘watchdog’. Jan 18 00:59:22 ip-10-0-21-221 systemd[1]: snapd.service: Consumed 52.492s CPU time. Jan 18 00:59:25 ip-10-0-21-221 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 1. Jan 18 00:59:27 ip-10-0-21-221 systemd[1]: Stopped Snap Daemon. Jan 18 00:59:27 ip-10-0-21-221 systemd[1]: snapd.service: Consumed 52.492s CPU time. Jan 18 00:59:59 ip-10-0-21-221 systemd[1]: Starting Snap Daemon… Jan 18 01:01:00 ip-10-0-21-221 systemd[1]: snapd.service: start operation timed out. Terminating. Jan 18 01:01:00 ip-10-0-21-221 systemd[1]: snapd.service: Failed with result ‘timeout’. Jan 18 01:01:00 ip-10-0-21-221 systemd[1]: Failed to start Snap Daemon. Jan 18 01:01:00 ip-10-0-21-221 systemd[1]: snapd.service: Consumed 1.195s CPU time. Jan 18 01:01:01 ip-10-0-21-221 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 2. . . . an 18 06:38:39 ip-10-0-21-221 systemd[1]: Starting Snap Daemon… Jan 18 06:48:12 ip-10-0-21-221 systemd[1]: snapd.service: start operation timed out. Terminating. Jan 18 06:49:01 ip-10-0-21-221 systemd[1]: snapd.service: Failed with result ‘timeout’. Jan 18 06:49:08 ip-10-0-21-221 systemd[1]: Failed to start Snap Daemon. Jan 18 06:49:48 ip-10-0-21-221 systemd[1]: snapd.service: Consumed 3.188s CPU time. Jan 18 06:52:29 ip-10-0-21-221 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 97. Jan 18 06:56:10 ip-10-0-21-221 systemd[1]: Stopped Snap Daemon. Jan 18 06:56:15 ip-10-0-21-221 systemd[1]: snapd.service: Consumed 3.188s CPU time. Jan 18 06:59:11 ip-10-0-21-221 systemd[1]: Starting Snap Daemon… Jan 18 07:18:14 ip-10-0-21-221 systemd[1]: snapd.service: start operation timed out. Terminating. Jan 18 07:23:24 ip-10-0-21-221 systemd[1]: snapd.service: Failed with result ‘timeout’. Jan 18 07:23:50 ip-10-0-21-221 systemd[1]: Failed to start Snap Daemon. Jan 18 07:44:05 ip-10-0-21-221 systemd[1]: snapd.service: Scheduled restart job, restart counter is at 98.

Unfortunately, the event was triggered during out-of-office hours, and when we tried to access it (18 hours after), we found that it was no longer accessible. The only thing we could do was restart the instance.

The primary issue is that we are unable to pinpoint the cause of the problem and fix it. This is the second time this scenario has occurred, and instance’s processes are impacted by this problem.

Could you help us?

mborzecki1 · January 19, 2024, 6:54am

Do you have more of the log which includes SIGABRT? There should stack traces right below the line you pasted.

Snapd is normally a bit chatty when starting, but I don’t see any of the in the log for some reason. You’d at least see a line like:

sty 19 07:48:45 galeon snapd[766673]: daemon.go:247: started snapd/2.56.2 (series 16; classic; devmode; testing) arch/ (amd64) linux/6.7.0-arch3-1.

Next time this happens, check the state of the socket service: systemctl status snapd.socket. You may also temporarily enable debug logs in snapd until you observe the issue again. This can be done by adding SNAPD_DEBUG=1 to the snapd service environment. You can either add it /etc/environment/snapd or systemctl edit snapd.service and add it to the override file.