Issues to address ahead of Ubuntu 18.04 release

zyga-snapd · April 11, 2018, 8:34pm

Hey everyone!

We have a number of important issues that we need to focus on to have the best release-day experience possible. This is the moment to stop and think and find solutions. Features and other bugs should wait.

This post is a wiki now, please feel free to edit it to add and clarify individual issues.

The list of issues is as follows, it is numbered by relative priority.

#1 spread doesn’t build with golang 1.6 anymore (SOLVED)

This issue is related to lack of Context that spread now depends on. Lack of working spread makes our autopkgtests fail and makes Xenial SRUs impossible. We need to either build spread with more recent version of go (a backport) or if possible provide a branch of spread that works with 1.6. Remember that autopkgtests don’t need to fire off machines in Linode or Google, just control Qemu.

@mvo pushed a possible fix here: https://github.com/snapcore/spread/pull/56 and shortly thereafter @niemeyer has merged it. This issue is now fixed.

#2 OOM on Bionic/i386 due to exhausted kernel lowmem. (SOLVED)

Tracked as: https://bugs.launchpad.net/apparmor/+bug/1750594

This issue manifests itself when autopkgtests run in a bionic i386 system. We have 1500MB of memory but run out relatively quickly during the interfaces-many test. We know of a kernel leak but we are unsure if this would account for the problem fully. More investigation is needed. Note that it also happens on a amd64 system but there the lowmem limit is not as small and it just takes longer. It is possible that it is somehow related to meltdown/spectre fixes that may have changed the memory split but we don’t know this for sure, don’t assume it is true

Dedicated thread: OOM for interfaces-many on bionic/i386
Scripts with commands to reproduce: http://paste.ubuntu.com/p/26xzSJ8NHJ/
@jdstrand has suggested that we disable offending test as the leak is not new and won’t be fixed across the stable kernels for some time. However its not easy as it is not one specific test that is to blame here.

We now have a fix for the kernel as well as a workaround. We will probably not include the workaround as it has some downsides (caching is invalidated). The kernel can be released in the SRU on release day or we may even make it into final. ZK will discuss this with leann on Monday.

#3 Memory use on minimal/constrained systems

This is partially tracked by https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1730159 and partially reported by this image

Screenshot

We need to understand where the memory is being used and if we could not run snapd at all unless necessary.

We discussed some stop-gap measures, one idea is to have a special wakeup unit for snapd (https://github.com/snapcore/snapd/pull/5038) that only fires when there are snaps installed or a seed file and that wakes snapd up. By default snapd would be socket activated. The downside of this approach is that no snap data for command-not-found would be pulled into the system.

#4 Reading meta/snap.yaml before snap is really mounted (WORKAOUND added in 2.32.5)

We don’t have much information (originally reported to @pedronis) but what we have seems to indicate that on a Ubuntu 16.04 (xenial) system (not inside a container so not using FUSE) snapd thinks that something is already mounted while it is not really mounted in practice. We then read a “shallow” snap.Info with a Broken field set to the error value. We recently (2.32.3) patched the interface repository to reject such snaps loudly but that’s just one of many places. We don’t understand the root cause or have tried to reproduce it yet.

A bug possibly related to this issue: https://bugs.launchpad.net/snappy/+bug/1616629

We didn’t manage so far to reproduce the bug, we have added better error handling around where we think the issue produces itself: https://github.com/snapcore/snapd/pull/5045
Even if the supposed not-mounted situation is persistent or is unchanged after 10s we will get at least a better error and handling.

#5 Refresh-mode signal affecting children of the main process (SOLVED)

We have one report about our recent addition of control for what happens to snap processes when refresh occurs in terms of signals that is under investigation. The report is that children are also being signaled, but so far we could not reproduce it in tests. We need to investigate this and check if this is really the case (perhaps systemd issue, perhaps outdated systemd) or is this secondary application effect (SIGPIPE on broken pipe or something similar).

ogra · April 12, 2018, 11:17am

regarding #4 …

just create /snap/<snapname>/<version>/meta/ on package install/refresh and copy snap.yaml there out of the squashfs …

then that info is available if you do not have /snap/<snapname>/ mounted … (and gets overlayed with the original file as soon as the mount is there)