OOM for interfaces-many on bionic/i386

mvo · February 20, 2018, 9:04am

While running our tests on bionic/i386 (in the autopkgtest VM) with kernel 4.13.0-32-generic I ran into the issue that the interfaces-many test lost the spread connection. Looking into this it turns out that this test triggers an out-of-memory condition and the kernel OOM killer goes wild killing the spread connection.

This can be reproduced with the following script: http://paste.ubuntu.com/p/26xzSJ8NHJ/ (it needs the test-snapd-policy-app-provider-classic and the test-snapd-policy-app-consumer snaps from the snapd git tree).

It not entirely straightforward though, it appears that the issue is not that snapd itself grows out of bounds. The kernel log (via syslog) show that the oom killer is invoked because the system runs out of “normal” memory (note how “free” was failing below “min”):

Feb 20 09:00:02 autopkgtest kernel: [  952.770375] Normal free:20552kB min:20644kB low:25804kB high:30964kB active_anon:0kB inactive_anon:0kB active_file:12kB inactive_file:20kB unevictable:0kB writepending:0kB present:493560kB managed:463576kB mlocked:0kB kernel_stack:728kB pagetables:0kB bounce:0kB free_pcp:612kB local_pcp:612kB free_cma:0kB

A wdiff of /proc/meminfo from the first and the last run confirms this (note how LowFree goes from 394Mb to 27Mb.

MemFree:         [-1174968-]          {+848776+} kB
MemAvailable:    [-1181104-]     {+787788+} kB
Buffers:           [-19396-]             {+188+} kB
Cached:           [-113592-]            {+62800+} kB
Active:            [-93976-]            {+51644+} kB
Inactive:          [-63188-]          {+45652+} kB
Active(anon):      [-24500-]      {+34652+} kB
Inactive(anon):     [-4012-]    {+14532+} kB
Active(file):      [-69476-]      {+16992+} kB
Inactive(file):    [-59176-]    {+31120+} kB
HighTotal:        923528 kB
HighFree:         [-780928-]         {+821288+} kB
LowTotal:         479492 kB
LowFree:          [-394040-]           {+27488+} kB
Dirty:                [-64-]                {+12+} kB
AnonPages:         [-24192-]         {+34312+} kB
Mapped:            [-29508-]            {+29408+} kB
Shmem:              [-4328-]             {+14872+} kB
Slab:              [-24344-]             {+101512+} kB
SReclaimable:      [-11932-]       {+9152+} kB
SUnreclaim:        [-12412-]        {+92360+} kB
KernelStack:         [-704-]         {+728+} kB
PageTables:          [-952-]         {+1016+} kB
CommitLimit:      701508 kB
Committed_AS:     [-150652-]     {+188328+} kB
VmallocTotal:     524288 kB
AnonHugePages:         [-0-]      {+2048+} kB
Hugepagesize:       2048 kB
DirectMap4k:       10232 kB
DirectMap2M:      499712 kB

It seems to be not related to snapd itself, the top log shows only a moderate increase (first and last entry in the log before the oom happend).

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1089 root      20   0  871592  17428  12128 S  0.0  1.2   0:12.67 snapd

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1089 root      20   0  871592  23768  12128 S  0.0  1.7   0:14.72 snapd

I’m a bit at a loss (looking at meminfo, slabtop and top) where this memory goes. It looks almost like a memory leak in the kernel. It seems that each “connect/disconnect” looses about ~15Mb of LowFree memory.

Another interesting data point is that when I replace apparmor_parser with /bin/true I can no longer trigger the oom condition (but downgrading apparmor/libapparmor1 to the artful version had no effect).

The next step should probably be trying this script on more systems to see if there is a pattern (different kernel, different apparmor, different go, different udev etc).

I put the test script to http://people.canonical.com/~mvo/tmp/oom-test.tar.gz - unpacking and running ./test.sh in a VM should be enough. Depending on the memory configuration the test.sh may not trigger the oom right away but the decrease of free memory should be easily observable.

Couldn’t reproduce it with kernel 4.10 (zesty) or lower.

jdstrand · February 20, 2018, 2:09pm

Interesting. Note to others: snap runs apparmor_parser on each changed profile in /var/lib/snapd/profiles/apparmor separately. Therefore, we should be able to create a simpler reproducer by:

boot into a xenial system (ie, no oom)
install the snaps
copy /var/lib/snapd/apparmor/profiles/* into /tmp/bug/orig
make the interface connections
copy /var/lig/snapd/apparmor/profiles into /tmp/bug/new
tar -zcvf /tmp/bug.tar.gz /tmp/bug

Now copy this tarball to an affected kernel and run appamor_parser -r on each profile in bug/orig and bug/new in succession in a loop to trigger. I’ll do this and attach to this topic.

UPDATE: note that doing this means that the squashfs for the snaps (core and the two from the reproducer) are no longer present.

jdstrand · February 20, 2018, 2:12pm

@mvo - how much memory does the VM have?

jdstrand · February 20, 2018, 2:13pm

From IRC:

08:12 < mvo> jdstrand: the vm has 1500mb
08:12 < mvo> jdstrand: but it seems like the LowMem is what triggers the oom on 
             i386 which is just 400mb
08:12 < jdstrand> mvo: this is i386 only/
08:13 < jdstrand> ?
08:13 < mvo> jdstrand: I suspect its amd64 too, it just takes a lot longer as 
             it has the full 1500mb of memory
08:13 < mvo> jdstrand: for some reason its the lowmem on i386 that runs out

jdstrand · February 20, 2018, 3:23pm

Here are just the apparmor profiles: http://people.canonical.com/~jamie/aa/bug.tar.gz

I can trigger OOM without snapd or any snaps installed after a while (few minutes) in an i386 VM with 768M by doing:

$ wget http://people.canonical.com/~jamie/aa/bug.tar.gz
$ tar -zxvf ./bug.tar.gz
$ while /bin/true ; do for i in bug/orig/* bug/new/* ; do sudo apparmor_parser --write-cache -O no-expr-simplify --cache-loc=/var/cache/apparmor --quiet  -r $i ; done ; done

I’ve reported this here: https://bugs.launchpad.net/apparmor/+bug/1750594

jdstrand · February 20, 2018, 3:25pm

Also from IRC:

08:26 < mvo> jdstrand: fwiw, it looks like amd64 is also affect, just takes 
             longer because of the different memory available (1480 vs 390)

lucyllewy · February 20, 2018, 6:23pm

gobble gobble nom nom all the rams :-p

jdstrand · February 22, 2018, 1:09pm

From IRC:

07:06 <@jdstrand> mvo: hey, so jjohansen looked at the memory issue quite a bit 
                  yesterday. the summary is that the profiles themselves are 
                  only taking no more than 30M of actual kernel memory. there 
                  is a small leak (*much* smaller than 30M; on the order of 2M 
                  after processing all the profiles in /var (if I'm reading 
                  jj's numbers right))
07:07 <@jdstrand> mvo: so it seems that the system was already under memory 
                  pressure, and that straw caused it to oom, but it was simply 
                  the straw that broke the camel's back. there were many more 
                  straws before it
07:08 <@jdstrand> mvo: jjohansen said he'll contine searching for that small 
                  leak. he's also investigating reducing that 30M by quite a bit

mvo · March 16, 2018, 9:28am

Fwiw, this error is now showing up in the autopkgtest in bionic on i386 even with the interfaces-many test disabled. This blocks 2.32 from entering bionic currently.

zyga-snapd · April 12, 2018, 8:58am

I will shortly attach a small tarball/zip that reproduces the issue very quickly. Looking at interface code the leak can be also in udev-related parts as many interfaces will cause udev “cycle” (trigger/settle). In my analysis I will focus on udev and apparmor separately, trying to come up with the more “costly” side of the problem.

zyga-snapd · April 12, 2018, 11:02am

I’m looking at this issue now and, surprise, it seems mainline master (close to e241e3f2bf97) we are not leaking memory (or if so, far far less). I will post some quantitative data (start: 1.96GB memory used, after running for about 30 minutes: 2.17GB)

Also, super interesting fact is that my test loop:

sudo sh -c 'for i in $(seq 10000); do snap connect test-snapd-policy-app-consumer:bluez && snap disconnect test-snapd-policy-app-consumer:bluez; done'

Is actually not changing apparmor profiles. I took a snapshot of before/after and to my astonishment, noticed that they are identical (all profiles generated by snapd). Since apparmor_parser is not loading anything new into the kernel in this situation (AFAIK, I need to confirm this) this may indicate that the leak happens by merely observing the loaded profiles.

EDIT: due to the use of partial confinement on mainline kernels the relevant (DBus) apparmor rules were not generated. As such my test was inconclusive. I will now focus on a loop of just apparmor_parser with the disconnected/connected pair.

jdstrand · April 12, 2018, 11:26am

jdstrand:

07:07 <@jdstrand> mvo: so it seems that the system was already under memory 
                  pressure, and that straw caused it to oom, but it was simply 
                  the straw that broke the camel's back. there were many more 
                  straws before it

We analyzed this again and I’m now of the position that these previous ‘straws’ may have to do with apparmor_parser replaces prior to this test. In other words, it seems possible that with hundreds or thousands of parser replace/unload operations (plausible with the number of spread tests we have) that the memory would leak to the point that the interfaces-all test would push it over the edge.

I suggest that for the time being that the test be disabled on i386 (most of its value is not-arch dependent, so running it on say, amd64 is sufficient for its goals). The fact that it caught the leak is good though and jjohansen will take a look at this, but a fix won’t be in time for snapd on bionic (it will happen in upstream/SRU).

jdstrand · April 12, 2018, 11:29am

The fact that this problem was there, then went away for a while then came back may have to do with the per-snap snap-update-ns profiles tests which increase the number of replace/unload operations.

zyga-snapd · April 12, 2018, 12:14pm

I just performed another interesting test.

On the Ubuntu artful kernel (4.13.16) simply loading a pre-compiled binary profile like this:

sudo sh -c 'for i in $(seq 10000); do apparmor_parser --replace --binary connected.bin; done'

Leaks significant amount of memory. What is interesting is this syslog message:

kwi 12 14:12:46 fyke audit[24211]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.test-snapd-policy-app-consumer.bluez" pid=24211 comm="apparmor_parser"

This seems to imply that a very specific error path (unchanged profile) is leaking memory. Since snapd reloads all profiles, even if unchanged, for a given snap, one way we could “save” some memory is to change interfaces/apparmor/backend.go to simply only load the changed profiles.

I will investigate the kernel code to see if there’s anything obviously leaky there.

EDIT: Some rough numbers, I forgot the base memory use of my machine (~1.8GB) but after 10K iterations of “load an unchanged profile” the memory usage has jumped to 2.7GB

The strace of a single apparmor_parser invocation as shown above is: http://paste.ubuntu.com/p/gXSjF2HPrF/

zyga-snapd · April 12, 2018, 12:26pm

I have proposed a RFC PR that will allow us to check if the workaround has any effect: https://github.com/snapcore/snapd/pull/5041

mvo · April 12, 2018, 3:18pm

I ran your PR with qemu:ubuntu-18.04-32 but it evenually also OOMed (in test 153/224).

zyga-snapd · April 13, 2018, 7:05am

I’m doing tests on each kernel in Ubuntu since 4.4 all the way up to 4.15. Early numbers indicate that 4.4 doesn’t leak memory in the range of 10M profile replacements. 4.10 (I skipped 4.8 for now) is also stable but I have not finished the full cycle. This puts strong indication at 4.13 as the kernel where inserting a profile really leaks memory. I will provide more data after my testing run is complete.

zyga-snapd · April 13, 2018, 11:47am

I did way more tests and digging and I can confirm that leak is only present when we are loading a policy that is already present in the kernel.

I can propose two work-arounds:

We can generate a random garbage rule that would make each profile different. for example we could allow read access to /.bug/bug-number/$RANDOM. Anything that doesn’t get optimised away by rule like /.bug/* is good. Loading this profile in a tight loop doesn’t affect memory used by the kernel. If we do this we will nullify the use of apparmor caching

We can inspect the profile we are replacing, compare the raw_hash, compare the raw_hash of the binary profile created by apparmor_parser --skip-kernel-load and only load the profile if hashes differ. This is more complex but doesn’t break caching.

jdstrand · April 13, 2018, 2:31pm

Please don’t do this. Let’s just fix the kernel instead and workaround in the testsuite for now. jjohansen is working on a kernel fix with priority which can hopefully be in the first kernel SRU.

zyga-snapd · April 14, 2018, 8:50am

The bug is now fixed by jj with this patch:

On Monday I will discuss including it with the release kernel with leann.