Out-of-disk-space protection

chipaca · August 10, 2017, 9:26am

I’d like to protect snapd and snappy systems from getting into trouble due to running out of disk space. In particular:

refuse a download if there’s not enough space for it
refuse any operation if there’s not enough space for state.

objections/ideas?

zyga-snapd · August 10, 2017, 9:28am

We could also consider things like:

Exposing “used disk” data per snap/revision in the API:
Offering a knob where the users can choose how many revisions to keep 3, 2 or just 1

chipaca · August 10, 2017, 9:39am

we already expose the “cheap” one (how much space the snap uses), although only the current one is shown in snap info. The expensive one we could do by hand via snap space or similar, but we get into the old “enumerate users” problem.

zyga-snapd · August 10, 2017, 9:54am

Interestingly I would say we want to show it for the current user (unprivileged) unless you sudo-ask when snapd could just know. As for enumerating users, I think it’s /root + /home/*/snap as a very rough but good approximate.

mvo · August 10, 2017, 10:56am

I think it is great to look at this. Of course oftern there is the problem of TOCTOU, i.e. when we check there is enough diskspace, the download starts and then some process starts filling up the disk. It would be interessting to double check that we are correctly undoing operations if the space runs out for state.json. Having some reserved space here might be good.

niemeyer · August 12, 2017, 5:34pm

As Michael suggests, instead of checking for space too strictly for what is being done, ideally it should check for some reasonable amount of operational space. We can start with something conservative… perhaps 100MB + expected download size? This won’t solve the problem completely as snaps can still fill up space after they start running, but it’s a start.

zyga-snapd · August 19, 2017, 8:30am

I agree with Gustavo, we should ensure that the system has a reasonable amount of free space left. I think we should be more conservative than that (~1GB is a good quantity) as this would allow us to keep updating core and perhaps other essential snaps (downloads/deltas).

chipaca · August 30, 2017, 3:31pm

Talking this over with @ogra (to make sure it was sane on devices), he points out that a good measure of how much space we should leave to hand is the amount of space needed for an essential snaps refresh (i.e. for core, kernel, gadget). That leads to ~250MB being reasonable on core (but better would be to determine exactly that size via looking at the current snaps, + some padding).

On classic, 100MB leaves ~30MB for random misc beyond a refresh of core, which seems fine.

In all cases we probably need this to be configurable. There always will be a scenario where the admin will need to shrink this in a hurry.

Also, in talking of ~30MB of space for growing, @pedronis rightly points out that this means that UpdateMany will have to have some of this logic (and the ability to do a selective refresh, as space permits).

Also that even a simple refresh will need to check what it’s trying to refresh before saying no based on disk space.

Lots of fun!

jhodapp · March 16, 2018, 6:25pm

@chipaca, @zyga-snapd, @mvo, @niemeyer

Can someone give an update on what the latest is on adding the disk size checks to the snapd? I’m asking on behalf of a customer who is running into a very critical issue where they’re not able to get an IP address on their network interface after a soft reboot when their eMMC part is full. So this will happen any time the device reboots for any reason, including for snap refreshes (which will also fail due to the partition being full).

If no work has begun on this, can someone comment on if this has been made the priority list?

chipaca · March 16, 2018, 7:57pm

I don’t think it’s been scheduled, but I’d gladly work on it next week. At least a first iteration of it.

What is the customer needing?

jhodapp · March 20, 2018, 8:54pm

Thanks. The customer basically needs the snapd system to keep working even when there’s no disk space. If that’s not possible, it needs to protect itself before it runs out of disk space. Apparently snapd doesn’t start any daemons when it runs out of disk space, so a snapped version of Network Manager fails to be started which means no network interfaces are brought up. For an IoT device, this is a critical issue.

In the meantime, we’ll also be suggesting to this customer to disable rsyslog (if this is possible for their uses) so that their disk doesn’t fill up so quickly.

Do we have any consensus about how we want to handle these situations?

chipaca · March 20, 2018, 11:13pm

Do we know what’s filling up their disk?

I ask because if it isn’t snapd itself filling the disk there isn’t a lot we can do, is there?

jhodapp · March 21, 2018, 7:17pm

I don’t know the full details yet but it sounds like mostly the syslog. But, ideally snapd should still be able to start application/daemons like Network Manager even when out of disk space.

Are there current design realities that prevent this from ever being possible?

chipaca · March 21, 2018, 7:23pm

snapd doesn’t start the daemons, that’s systemd’s job. Having said that, I’m not surprised that network manager can’t start when out of space (systemd might even be telling it to start). I wouldn’t be shocked to learn that dbus also fails to start when out of disk space.

The out-of-disk-space protection is about snapd not eating all of the space; what would the customer do (or what would we do for them) in a non-snappy system? Because it sounds like their problem is not related to snapd.

jhodapp · March 21, 2018, 7:26pm

Those are great points and I guess what we’re talking about in this case is more an Ubuntu Core system in general and how it handles out of disk space conditions. As you say, there’s certain things that snapd should/shouldn’t do when out of space but there are also orthogonal issues that occur unrelated to snapd specifically, but still part of an Ubuntu Core system more generally.

We should start by listing some use cases that we want to help protect when the system has very low or completely run out of disk space.

svet · March 22, 2018, 7:01am

Just adding my ¢2:

On more recent systems, unless you’ve changed the default behavior, journald shouldn’t actually be logging to disk: see Change in logging behaviour on Ubuntu Core.

At https://gist.github.com/JPvRiel/b7c185833da32631fa6ce65b40836887 there’s a nice overview of how to enable/disable persistent journal storage. As that article says, even with persistent storage enabled there should be “sane” defaults for SystemMaxUse and SystemKeepFree of 10% and 15% respectively. I’ve experienced these settings to work as expected in practice.

Even more broadly than Ubuntu Core, I’d say that this is a Linux system question: how do you ensure that there is always enough space left for “really critical” services to run, when other “maybe also critical” processes are also competing for that space? If anyone has come across effective solutions to this issue, I’d love to hear them - though maybe that’s a discussion for a separate thread.

Measures such as reserved filesystem space, or user quotas, are probably not adequate on a system where many of these processes are running as root anyway (but maybe some segregation is still possible?). In days past, some adequate segregation may have been achieved through disk partitioning, but that hardly seems like the right solution on contemporary systems.

Either way, ensuring that individual processes like snapd are conscious of their own space use is a good step forward!

ejfinneran · August 8, 2019, 7:37pm

It sounds like a lot of the focus on full disk protection is ensuring that snapd can start/run but other critical services need attention too.

For example, if network-manager can’t run on headless systems (ie IoT gateways) the unit is effectively bricked. These systems are often mounted to the ceiling and networking is the only way to tell snapd what to do.

Ubuntu Classic systems don’t seem to have this problem. mkfs leaves 5% of the blocks on a disk reserved for root so even if a user space app fills the disk, critical services that run as root can still function. In UC everything runs as root so those protections are effectively bypassed.