While running our tests on bionic/i386 (in the autopkgtest VM) with kernel 4.13.0-32-generic I ran into the issue that the interfaces-many
test lost the spread connection. Looking into this it turns out that this test triggers an out-of-memory condition and the kernel OOM killer goes wild killing the spread connection.
This can be reproduced with the following script: http://paste.ubuntu.com/p/26xzSJ8NHJ/ (it needs the test-snapd-policy-app-provider-classic and the test-snapd-policy-app-consumer snaps from the snapd git tree).
It not entirely straightforward though, it appears that the issue is not that snapd itself grows out of bounds. The kernel log (via syslog) show that the oom killer is invoked because the system runs out of ânormalâ memory (note how âfreeâ was failing below âminâ):
Feb 20 09:00:02 autopkgtest kernel: [ 952.770375] Normal free:20552kB min:20644kB low:25804kB high:30964kB active_anon:0kB inactive_anon:0kB active_file:12kB inactive_file:20kB unevictable:0kB writepending:0kB present:493560kB managed:463576kB mlocked:0kB kernel_stack:728kB pagetables:0kB bounce:0kB free_pcp:612kB local_pcp:612kB free_cma:0kB
A wdiff of /proc/meminfo
from the first and the last run confirms this (note how LowFree goes from 394Mb to 27Mb.
MemFree: [-1174968-] {+848776+} kB
MemAvailable: [-1181104-] {+787788+} kB
Buffers: [-19396-] {+188+} kB
Cached: [-113592-] {+62800+} kB
Active: [-93976-] {+51644+} kB
Inactive: [-63188-] {+45652+} kB
Active(anon): [-24500-] {+34652+} kB
Inactive(anon): [-4012-] {+14532+} kB
Active(file): [-69476-] {+16992+} kB
Inactive(file): [-59176-] {+31120+} kB
HighTotal: 923528 kB
HighFree: [-780928-] {+821288+} kB
LowTotal: 479492 kB
LowFree: [-394040-] {+27488+} kB
Dirty: [-64-] {+12+} kB
AnonPages: [-24192-] {+34312+} kB
Mapped: [-29508-] {+29408+} kB
Shmem: [-4328-] {+14872+} kB
Slab: [-24344-] {+101512+} kB
SReclaimable: [-11932-] {+9152+} kB
SUnreclaim: [-12412-] {+92360+} kB
KernelStack: [-704-] {+728+} kB
PageTables: [-952-] {+1016+} kB
CommitLimit: 701508 kB
Committed_AS: [-150652-] {+188328+} kB
VmallocTotal: 524288 kB
AnonHugePages: [-0-] {+2048+} kB
Hugepagesize: 2048 kB
DirectMap4k: 10232 kB
DirectMap2M: 499712 kB
It seems to be not related to snapd itself, the top log shows only a moderate increase (first and last entry in the log before the oom happend).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1089 root 20 0 871592 17428 12128 S 0.0 1.2 0:12.67 snapd
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1089 root 20 0 871592 23768 12128 S 0.0 1.7 0:14.72 snapd
Iâm a bit at a loss (looking at meminfo, slabtop and top) where this memory goes. It looks almost like a memory leak in the kernel. It seems that each âconnect/disconnectâ looses about ~15Mb of LowFree memory.
Another interesting data point is that when I replace apparmor_parser with /bin/true I can no longer trigger the oom condition (but downgrading apparmor/libapparmor1 to the artful version had no effect).
The next step should probably be trying this script on more systems to see if there is a pattern (different kernel, different apparmor, different go, different udev etc).
I put the test script to http://people.canonical.com/~mvo/tmp/oom-test.tar.gz - unpacking and running ./test.sh in a VM should be enough. Depending on the memory configuration the test.sh may not trigger the oom right away but the decrease of free memory should be easily observable.
Couldnât reproduce it with kernel 4.10 (zesty) or lower.