Snap freezes after update to Ubuntu 22.04

Hi lukav,

I am absolutely no expert, but to me this sounds like a pure update problem. Whatever it was, something did not properly upgrade to 22.04. In your situation, I would either:

  1. Start from scratch with a clean 22.04. If you have a separate partition for your home folder, this should be a piece of cake. This is the easy way out.

  2. If you want to get to the bottom of this: compare your filesystem (certainly the apparmor profiles & configs) with a clean 22.04. List up your packages with sudo apt list --installed , then install them on a clean 22.04 VM, then compare filesystems, excluding home using e.g. rdiff.

I know the first one sounds like Microsoft Windows advice, but it depends of course on how critical your current situation is.

Kind regards,

Brecht

Hi @lukav, can you please print the output of

mount | grep cgroup

I wonder if there could be something wrong with your cgroups setup, since the logs you pasted stop exactly at the point where cgroups should be used (or did you maybe cut off the logs earlier by mistake?).

Hi @mardy,

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

I’m not familiar with cgroup and haven’t debug them before so any help would be greatly appreciated. But I think you may be right since yesterday I was playing around with systemd-nspawn and I saw an error about some cgroup proxy service. I didn’t investigate since it didn’t block my experiments. Do you have any suggestions what to check?

No. Those are the full logs. If I cut the logs somewhere I put … so it is obvious.

OK, this like is identical on my machine. I do also have another line about a cgroup v1 namespace, but I don’t think it can be related to this issue:

none on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls)

Not really. Does the program terminate, or does it hang? Can you post a link to the output of sudo journalctl -b (after trying to start at least one snap, like hello-world)?

It just hangs and it never finishes. I have to kill it with kill -KILL. That is also why I can not event install chromium since it has an install hook with also doesn’t finish. Here is the output after running hello-world:

Aug 30 10:05:16 Lukav-Beast systemd[4381]: Started snap.hello-world.hello-world.baa51e23-325c-4682-a3f4-f4c4c2f8753f.scope.
Aug 30 10:05:16 Lukav-Beast systemd[1]: tmp-snap.rootfs_g2TkNQ.mount: Deactivated successfully.

OK, it looks like snap-update-ns makes some mistakes when freezing the cgroup, and ends up freezing itself. Can you please paste here the output of /proc/<PID>/cgroup, where PID is the process id of:

  • hello-world
  • snap-update-ns (you might have several of these ones running; try figuring out which one is the correct one, based on its parent PID, which should be the one of hello-world)

They all have the same

$ ps auxf | grep snap-
root     1017352  0.8  0.0   3060  2224 pts/6    S+   10:37   0:00          |   |   \_ /usr/lib/snapd/snap-confine snap.hello-world.hello-world /usr/lib/snapd/snap-exec hello-world
root     1017373  0.0  0.0   3060   244 pts/6    S+   10:37   0:00          |   |       \_ /usr/lib/snapd/snap-confine snap.hello-world.hello-world /usr/lib/snapd/snap-exec hello-world
root     1017379  0.0  0.0 1224108 9392 pts/6    Sl+  10:37   0:00          |   |       \_ snap-update-ns --from-snap-confine hello-world
$ cat /proc/1017379/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope
$ cat /proc/1017373/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope
$ cat /proc/1017352/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope

Thanks, and what does

ls -d /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.*

say?

Also, what does

cat /proc/cmdline
systemctl --version

say?

$ ls -d /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.*
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope
$ ls -la /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.hello-world.hello-world.ce156e19-97a9-45e4-8fe3-7190dfb768ba.scope/
total 0
drwxr-xr-x  2 lukav lukav 0 Aug 30 10:37 .
drwxr-xr-x 23 lukav lukav 0 Aug 30 11:14 ..
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.controllers
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.freeze
--w-------  1 lukav lukav 0 Aug 30 10:37 cgroup.kill
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.max.depth
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.max.descendants
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.procs
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.subtree_control
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.threads
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cgroup.type
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 cpu.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 cpu.stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 io.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.events
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.events.local
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.high
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.low
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.max
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.min
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.numa_stat
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.oom.group
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.pressure
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.stat
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.high
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 memory.swap.max
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 pids.current
-r--r--r--  1 lukav lukav 0 Aug 30 10:37 pids.events
-rw-r--r--  1 lukav lukav 0 Aug 30 10:37 pids.max
BOOT_IMAGE=/vmlinuz-5.15.0-46-generic root=[redacted] ro quiet splash vt.handoff=7
$ systemctl --version
systemd 249 (249.11-0ubuntu3.4)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP -LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

Thanks, it’s all exactly like on my machine. Yet, the line

2022/08/22 15:57:27.662141 tracking.go:148: DEBUG: systemd could not associate process 1244425 with transient scope snap.hello-world.hello-world.3b6af5b2-24e5-4660-93e8-32ee6785caf9.scope

from your logs says that there’s definitely something wrong with the cgroup. Can you please try running these commands and paste here the output?

sleep 3600 &
SLEEP_PID=$!
busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
    org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
    sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
    && cat /proc/$SLEEP_PID/cgroup
kill $SLEEP_PID

Thank you @mardy, for all your effort. Here we go:

$ sleep 3600 &
SLEEP_PID=$!
[2] 1423597

$ echo ${SLEEP_PID}
1423597

$ busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
    org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
    sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
    && cat /proc/$SLEEP_PID/cgroup
o "/org/freedesktop/systemd1/job/20283"
1:name=systemd:/
0::/user.slice/user-1000.slice/user@1000.service/app.slice/sleeping.scope

$ kill $SLEEP_PID
[2]+  Terminated              sleep 3600

I’m running out of ideas :frowning:

If you re-run the same commands a few times, do you always get the same result (especially the 0:: line)?

Also, can you paste the output of statfs /sys/fs/cgroup/?

Last but not least, can you confirm that running

echo 0 > /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/snap.<snap-name>.*/cgroup.freeze

allows the application to start?

:frowning:

I’ve ran it 4 times only once the output was just:

0::/user.slice/user-1000.slice/session-c4.scope

instead of

0::/user.slice/user-1000.slice/user@1000.service/app.slice/sleeping.scope

But I guess that busctl just started too early before the sleep has actually started sleeping.

$ statfs /sys/fs/cgroup/
{
	"Type": 1667723888,
	"Bsize": 4096,
	"Blocks": 0,
	"Bfree": 0,
	"Bavail": 0,
	"Files": 0,
	"Ffree": 0,
	"Fsid": {
		"Val": [
			0,
			0
		]
	},
	"Namelen": 255,
	"Frsize": 4096,
	"Flags": 4142,
	"Spare": [
		0,
		0,
		0,
		0
	]
}

The application did not start, but at least the process finished. It showed:

cannot update snap namespace: cannot finish freezing processes of snap "hello-world": cannot freeze processes of snap "hello-world" in group snap.hello-world.hello-world.71c2af7b-65c4-4592-a6dd-2ca030af4e2e.scope
snap-update-ns failed with code 1

I’m not convinced; I think that the shell would not move to execute the next command until it has gotten the PID of the background process. Can you please try running this script, then (you can copy&paste this into test-cgroup.sh):

#! /bin/bash

set -e

while true
do
    sleep 3600 &
    sleep 0.1
    SLEEP_PID=$!
    busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
        org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
        sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
        && grep -q sleeping /proc/$SLEEP_PID/cgroup
    kill $SLEEP_PID
done

I ran it a few times. it outputs a different number of rows like this:

$ ./test-cgroup.sh 
o "/org/freedesktop/systemd1/job/21373"
o "/org/freedesktop/systemd1/job/21377"
o "/org/freedesktop/systemd1/job/21381"
o "/org/freedesktop/systemd1/job/21385"
o "/org/freedesktop/systemd1/job/21389"

then it stops on its own and a sleep 3600 process stays active. If I don’t kill it on next run I get only:

$ ./test-cgroup.sh 
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.
Call failed: Unit sleeping.scope already exists.

After how many lines does it stop? Actually, can you try this version of the script:

#! /bin/bash

set -e

while true
do
    sleep 3600 &
    sleep 0.1
    SLEEP_PID=$!
    busctl call --user org.freedesktop.systemd1 /org/freedesktop/systemd1 \
        org.freedesktop.systemd1.Manager StartTransientUnit 'ssa(sv)a(sa(sv))' \
        sleeping.scope fail 1 PIDs au 1 $SLEEP_PID 0 \
        && sleep 0.1 \
        && grep -q sleeping /proc/$SLEEP_PID/cgroup
    kill $SLEEP_PID
done

This one should never stop. If it stops, it means that some assumptions we make in snapd are wrong, and we should increase the timeout for our operation.

On which machine are you seeing this problem? Just wondering if it could be especially slow.

Confirmed. This one never stops.

...
o "/org/freedesktop/systemd1/job/24385"
o "/org/freedesktop/systemd1/job/24389"
o "/org/freedesktop/systemd1/job/24393"
o "/org/freedesktop/systemd1/job/24397"
o "/org/freedesktop/systemd1/job/24401"
...
  • Info: 12-core model: AMD Ryzen Threadripper 1920X
  • Memory: 62.65 GiB used: 21.62 GiB (34.5%)
  • The OS disks are 2 SSD 970 EVO 500GB in RAID1

Maybe then your machine is too fast :wink:

Can you please check, if the line

systemd could not associate process <PID> with transient scope snap.*.scope

is printed everytime you start a snap, or only sometimes? I’m a bit puzzled, because the probability of you seeing this line should be (unless we have some error in our code, of course) be about the same as having that script fail.

From 10 tries I got it 10 times. I tried starting hello-world with nice -n 19 and ionice -c 3, but the result is the same :frowning: