Snap freezes after update to Ubuntu 22.04

lukav · August 30, 2022, 1:31pm

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

I’m not familiar with cgroup and haven’t debug them before so any help would be greatly appreciated. But I think you may be right since yesterday I was playing around with systemd-nspawn and I saw an error about some cgroup proxy service. I didn’t investigate since it didn’t block my experiments. Do you have any suggestions what to check?

No. Those are the full logs. If I cut the logs somewhere I put … so it is obvious.

mardy · August 30, 2022, 2:02pm

OK, this like is identical on my machine. I do also have another line about a cgroup v1 namespace, but I don’t think it can be related to this issue:

none on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls)

Not really. Does the program terminate, or does it hang? Can you post a link to the output of sudo journalctl -b (after trying to start at least one snap, like hello-world)?

lukav · August 30, 2022, 2:07pm

It just hangs and it never finishes. I have to kill it with kill -KILL. That is also why I can not event install chromium since it has an install hook with also doesn’t finish. Here is the output after running hello-world:

Aug 30 10:05:16 Lukav-Beast systemd[4381]: Started snap.hello-world.hello-world.baa51e23-325c-4682-a3f4-f4c4c2f8753f.scope.
Aug 30 10:05:16 Lukav-Beast systemd[1]: tmp-snap.rootfs_g2TkNQ.mount: Deactivated successfully.

mardy · August 30, 2022, 2:31pm

OK, it looks like snap-update-ns makes some mistakes when freezing the cgroup, and ends up freezing itself. Can you please paste here the output of /proc/<PID>/cgroup, where PID is the process id of:

hello-world
snap-update-ns (you might have several of these ones running; try figuring out which one is the correct one, based on its parent PID, which should be the one of hello-world)

lukav · August 30, 2022, 2:41pm

They all have the same

$ ps auxf | grep snap-
root     1017352  0.8  0.0   3060  2224 pts/6    S+   10:37   0:00          |   |   \_ /usr/lib/snapd/snap-confine snap.hello-world.hello-world /usr/lib/snapd/snap-exec hello-world
root     1017373  0.0  0.0   3060   244 pts/6    S+   10:37   0:00          |   |       \_ /usr/lib/snapd/snap-confine snap.hello-world.hello-world /usr/lib/snapd/snap-exec hello-world
root     1017379  0.0  0.0 1224108 9392 pts/6    Sl+  10:37   0:00          |   |       \_ snap-update-ns --from-snap-confine hello-world

$ cat /proc/1017379/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope

$ cat /proc/1017373/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope

$ cat /proc/1017352/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope

mardy · August 30, 2022, 2:53pm

Thanks, and what does

ls -d /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.*

say?

mardy · August 30, 2022, 3:07pm

Also, what does

cat /proc/cmdline
systemctl --version

say?

lukav · August 30, 2022, 3:18pm

$ ls -d /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.*
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope

$ ls -la /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope/
total 0
drwxr-xr-x  2 lukav lukav 0 Aug 30 10:37 .
drwxr-xr-x 23 lukav lukav 0 Aug 30 11:14 ..
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.controllers
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.freeze
--w-------  1 lukav lukav 0 Aug 30 10:37 cgroup.kill
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.max.depth
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.max.descendants
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.procs
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.subtree_control
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.threads
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.type
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cpu.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cpu.stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 io.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.events
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.events.local
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.high
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.low
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.max
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.min
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.numa_stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.oom.group
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.stat
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.high
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.max
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 pids.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 pids.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 pids.max

BOOT_IMAGE=/vmlinuz-5.15.0-46-generic root=[redacted] ro quiet splash vt.handoff=7

$ systemctl --version
systemd 249 (249.11-0ubuntu3.4)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP -LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

mardy · August 31, 2022, 8:56am

Thanks, it’s all exactly like on my machine. Yet, the line

2022/08/22 15:57:27.662141 tracking.go:148: DEBUG: systemd could not associate process 1244425 with transient scope snap.hello-world.hello-world.3b6af5b2-24e5-4660-93e8-32ee6785caf9.scope

from your logs says that there’s definitely something wrong with the cgroup. Can you please try running these commands and paste here the output?

sleep 3600 &
SLEEP_PID=$!
busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
    org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
    sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
    && cat /proc/$SLEEP_PID/cgroup
kill $SLEEP_PID

lukav · August 31, 2022, 1:57pm

Thank you @mardy, for all your effort. Here we go:

$ sleep 3600 &
SLEEP_PID=$!
[2] 1423597

$ echo ${SLEEP_PID}
1423597

$ busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
    org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
    sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
    && cat /proc/$SLEEP_PID/cgroup
o "/org/freedesktop/systemd1/job/20283"
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/sleeping.scope

$ kill $SLEEP_PID
[2]+  Terminated              sleep 3600

mardy · August 31, 2022, 2:58pm

I’m running out of ideas

If you re-run the same commands a few times, do you always get the same result (especially the 0:: line)?

Also, can you paste the output of statfs /sys/fs/cgroup/?

Last but not least, can you confirm that running

echo 0 > /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.<snap-name>.*/cgroup.freeze

allows the application to start?

lukav · August 31, 2022, 3:26pm

I’ve ran it 4 times only once the output was just:

0::/user.slice/user-1000.slice/session-c4.scope

instead of

0::/user.slice/user-1000.slice/user@1000.service/app.slice/sleeping.scope

But I guess that busctl just started too early before the sleep has actually started sleeping.

$ statfs /sys/fs/cgroup/
{
	"Type": 1667723888,
	"Bsize": 4096,
	"Blocks": 0,
	"Bfree": 0,
	"Bavail": 0,
	"Files": 0,
	"Ffree": 0,
	"Fsid": {
		"Val": [
			0,
			0
		]
	},
	"Namelen": 255,
	"Frsize": 4096,
	"Flags": 4142,
	"Spare": [
		0,
		0,
		0,
		0
	]
}

mardy:

Last but not least, can you confirm that running
echo 0 > /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.<snap-name>.*/cgroup.freeze
allows the application to start?

The application did not start, but at least the process finished. It showed:

cannot update snap namespace: cannot finish freezing processes of snap "hello-world": cannot freeze processes of snap "hello-world" in group snap.hello-world.hello-world.71c2af7b-65c4-4592-a6dd-2ca030af4e2e.scope
snap-update-ns failed with code 1

mardy · August 31, 2022, 3:46pm

lukav:

I’ve ran it 4 times only once the output was just:
0::/user.slice/user-1000.slice/session-c4.scope
instead of
0::/user.slice/user-1000.slice/user@1000.service/app.slice/sleeping.scope
But I guess that busctl just started too early before the sleep has actually started sleeping.

I’m not convinced; I think that the shell would not move to execute the next command until it has gotten the PID of the background process. Can you please try running this script, then (you can copy&paste this into test-cgroup.sh):

#! /bin/bash

set -e

while true
do
    sleep 3600 &
    sleep 0.1
    SLEEP_PID=$!
    busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
        org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
        sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
        && grep -q sleeping /proc/$SLEEP_PID/cgroup
    kill $SLEEP_PID
done

lukav · August 31, 2022, 4:12pm

I ran it a few times. it outputs a different number of rows like this:

$ ./test-cgroup.sh 
o "/org/freedesktop/systemd1/job/21373"
o "/org/freedesktop/systemd1/job/21377"
o "/org/freedesktop/systemd1/job/21381"
o "/org/freedesktop/systemd1/job/21385"
o "/org/freedesktop/systemd1/job/21389"

then it stops on its own and a sleep 3600 process stays active. If I don’t kill it on next run I get only:

$ ./test-cgroup.sh 
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.

mardy · August 31, 2022, 7:02pm

lukav:

I ran it a few times. it outputs a different number of rows like this:
$ ./test-cgroup.sh 
o "/org/freedesktop/systemd1/job/21373"
o "/org/freedesktop/systemd1/job/21377"
o "/org/freedesktop/systemd1/job/21381"
o "/org/freedesktop/systemd1/job/21385"
o "/org/freedesktop/systemd1/job/21389"
then it stops on its own and a sleep 3600 process stays active.

After how many lines does it stop? Actually, can you try this version of the script:

#! /bin/bash

set -e

while true
do
    sleep 3600 &
    sleep 0.1
    SLEEP_PID=$!
    busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
        org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
        sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
        && sleep 0.1 \
        && grep -q sleeping /proc/$SLEEP_PID/cgroup
    kill $SLEEP_PID
done

This one should never stop. If it stops, it means that some assumptions we make in snapd are wrong, and we should increase the timeout for our operation.

On which machine are you seeing this problem? Just wondering if it could be especially slow.

lukav · August 31, 2022, 8:53pm

Confirmed. This one never stops.

...
o "/org/freedesktop/systemd1/job/24385"
o "/org/freedesktop/systemd1/job/24389"
o "/org/freedesktop/systemd1/job/24393"
o "/org/freedesktop/systemd1/job/24397"
o "/org/freedesktop/systemd1/job/24401"
...

Info: 12-core model: AMD Ryzen Threadripper 1920X
Memory: 62.65 GiB used: 21.62 GiB (34.5%)
The OS disks are 2 SSD 970 EVO 500GB in RAID1

mardy · September 1, 2022, 5:44pm

Maybe then your machine is too fast

Can you please check, if the line

systemd could not associate process <PID> with transient scope snap.*.scope

is printed everytime you start a snap, or only sometimes? I’m a bit puzzled, because the probability of you seeing this line should be (unless we have some error in our code, of course) be about the same as having that script fail.

lukav · September 1, 2022, 8:55pm

From 10 tries I got it 10 times. I tried starting hello-world with nice -n 19 and ionice -c 3, but the result is the same

mardy · September 2, 2022, 6:25am

Then it sounds like we might be doing something wrong in snapd. Would you be so kind to try our this custom build of snapd (you need to install it using the --dangerous option), then run the hello-world snap again with SNAPD_DEBUG set to 1 and paste the logs?

(there might be about a couple of hundreds of lines more, so better paste it into a pastebin and only add a link here)

lukav · September 2, 2022, 2:06pm

I hope I did it right: https://paste.ubuntu.com/p/Y76NZhDyRP/

I used the echo 0 > /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.*/cgroup.freeze to finish the process