Memory (/run memory file system) leaking with snap install/remove

mwinter · April 28, 2017, 12:09am

I’ve asked this before on the snapcraft mailing list in february, but with no response. So sorry for the duplicate.

I’m testing snap’s I’m building in a CI system and as part of the test setup, I’m installing/removing snaps on a frequent base.
I’ve noticed that the /run filesystem (which is in memory and 200MB on my VMs) is filling up. In my case, I end up with approx 45’000 small files over a 2 week period and snap install/remove will fail afterwards because the /run filesys is full. Only workaround so far seems to reboot the systems every few days.

This is consistent on snapd 2.21, 2.22.6, 2.23, 2.24.1 (latest) and many earlier versions.
This is on Ubuntu 16.04 (x86_64)

The files which fill up the filesys are in /run/udev/data. 90% of my files start with a filename of +cgroup:

Anyone else seeing similar issue? Ideas on what’s going on?

ogra · April 28, 2017, 10:09am

this is simply the default udev database, there should be one file per device existing on your system like on every linux install that uses systemd-udevd, none of them should be bigger than 4K, are you sure it is that dir that actually harms your ram ?
what does “du -hcs /run/udev/data/*” show as occupied space (i find it very unlikely that this db can actually fill 200MB even if you have a ton of devices)

mwinter · April 29, 2017, 1:38am

I’ve rebooted them yesterday.

One of the servers now has the /run at 69% (of 200MB) filled up and currently shows:
with du -hcs /run/udev/data/* at the end 115M total

Current /run filesystem is:

tmpfs             204796  140408     64388  69% /run

So, you may find it unlikely, but it is growing… (the usage of /run starts at 5…6% after a clean reboot)

ogra · April 29, 2017, 10:28am

what kind of VM setup is that (docker, lxd, vmware, vbox, simple qemu kvm) ?
what are the VM images you use (ubuntu-server, cloud, some self rolled ones) ?
is the machine up to date (as well as the VM images) ?.
any re-occuring log entries in dmesg, syslog, journalctl on either host or client ?
is the host machine running a default ubuntu kernel ?

as a temporary workaround you can flush the db regulary with using: “sudo udevadm info --cleanup-db”

i’d suggest to file a bug where you can submit the logs and system info and link it here …

mwinter · April 29, 2017, 10:52am

The hypervisor system is KVM (QEMU) on CentOS 7.
The VMs are standard Ubuntu Server 16.04 (amd64)
Everything up to date (host & vm’s)
The VM’s are configured as simple hosts with 1 core, 2GB RAM, 8GB disk

On kernel, I’m not running the default 4.4 kernel, but the 4.8 kernel from the official ubuntu repo.
Currently it’s 4.8.0-42-generic
(I can’t run 4.4 as I need MPLS networking)

root@ci-comp11-dut:~#  snap --version
snap    2.24.1
snapd   2.24.1
series  16
ubuntu  16.04
kernel  4.8.0-42-generic

I see some message in syslog: (frr is my snap I’m testing)

[...]
Apr 28 09:36:47 ci-comp11-dut frr.vtysh[10289]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
Apr 28 09:36:47 ci-comp11-dut frr.vtysh[10303]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
Apr 28 09:36:48 ci-comp11-dut systemd[1]: Started Session 1815 of user root.
Apr 28 09:36:48 ci-comp11-dut frr.vtysh[10370]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
Apr 28 09:36:48 ci-comp11-dut systemd[1]: Started Session 1816 of user root.
Apr 28 09:36:48 ci-comp11-dut frr.vtysh[10438]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
Apr 28 09:36:50 ci-comp11-dut frr.vtysh[10458]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
Apr 28 09:36:50 ci-comp11-dut systemd[1]: Started Session 1817 of user root.
Apr 28 09:36:50 ci-comp11-dut frr.vtysh[10526]: cmd.go:114: DEBUG: not restarting into "/snap/core/current/usr/bin/snap" ([VERSION=2.23.6 2.23.6]): older than "/usr/bin/snap" (2.24.1)
[...]

The “udevadm info --cleanup-db” command seems to work and cleans up the directory.
(Most of the VMs are back to 100% full on /run by now)

Where do you want me to open the bug?

Martin

ogra · April 29, 2017, 12:01pm

… is the place to file a bug …

if you try a VM with the default 4.4 kernel (linux-generic), does it happen too ?

niemeyer · May 1, 2017, 12:59pm

If the problem is with snapd itself, there’s no need to file a bug since we have pretty good information above, thanks @mwinter!

@ogra Do we need any other details or do we know what to do?

ogra · May 1, 2017, 1:09pm

well, i dont see that leaking of +cgroup entries from snaps anywhere on either desktop installs nor on UbuntuCore images, we need to find out where they come from. i wonder if there is any custom udev rule at play here or something like that.

niemeyer · May 1, 2017, 1:39pm

@ogra Do you need any more information from @mwinter or do we have enough to dig down a bit?

ogra · May 1, 2017, 4:56pm

we obviously need to be able to reproduce to dig ourselves and we aren’t … i’d prefer a bug for now and collect all possible logs from @mwinter as a start.

my suspicion goes towards the centos qemu version (or some issue with the host kernel in context with it)

niemeyer · May 1, 2017, 5:17pm

We already have some logs above, and we have good information saying which command solved the problem, clearly indicating this is a udev issue. Let’s please close down on the issue here instead of just moving the conversation elsewhere and splitting the thread too early.

ogra · May 1, 2017, 5:34pm

the command “solving” the issue is the equivalent of “rm -rf /run/udev/data/*” … i.e. completely deleting the udev database.

i highly doubt it has anything to do with udev at all, this is just fallout of something (not neccessarily snapd) going wild and creating +cgroup entries in a loop.

it should really be moved into a bug with complete logs (syslog, journald dump, dmesg, kern.log, udev debug output) attached etc.

a discussion in the forum wont help much further, i agree.

niemeyer · May 1, 2017, 6:57pm

@mwinter I’ve found a very similar report which indicates that a system update had fixed the issue. Are there any updates pending (please take note so we know which), and if so can you please update and restart the related processes to see if it helps?

Thanks again.

mwinter · May 1, 2017, 11:54pm

I’m updated on all packages. The only different issue is that I’m running the alternative (from the Ubuntu Repo) 4.8 kernel and not the default 4.4 kernel.

I’m trying to see if I can reproduce the issue with 4.4, but not optimistic as my CI tests need > 4.5 kernel for most functions, so it would be a very reduced special setup.

However, I can provide access to a system in this bad state if this helps.

Bug opened: https://bugs.launchpad.net/snapd/+bug/1687507

ogra · May 2, 2017, 10:45am

thanks a lot for the log there … this and the

you pasted above look like either a problerm with systemd or the way your CI scripts use it …

note that many of the cgroup files in that list are not related to snaps at all (there are apt updates in there and the like). i’m closing the snapd task on the bug as invalid and will open a systemd one so someone from the foundations team can take over.