5 take-aways from 5 years of Snaps for IoT

In late 2017 I started playing around with Ubuntu Core, and soon after ended up deploying snaps to a fleet of IoT devices. Over the past 5 years, at my start-up (AMMP), we’ve used snaps to deploy software to several hundred dataloggers monitoring renewable energy systems in remote locations. I’ve had many reflections on the experience over this period, and I thought it’s time to share some of the main ones with the community.

My one-sentence takeaway is that I don’t think snaps are set up to be a serious contender in the IoT space. There aren’t many issues that make snaps fundamentally unfit for IoT (though there are some). What primarily worries me is that the direction of development - both past and present - appears to be geared towards improving the snap experience on the desktop. Truly cracking this is a major undertaking. And I believe that it will be fundamentally difficult to have a single package architecture that caters well to both desktop as well as IoT needs. The trade-offs are just too stark. I elaborate on this in the points below.

First, a couple of caveats:

Caveat 1: After trying to work with Ubuntu Core at the start, we did not use it in real-world deployments. Instead we installed snaps on top of other Debian-based OSes. This was for two reasons:

  1. Due to functionality requirements, we use hardware from vendors like UniPi, Sfera Labs, and Moxa. Even though most of the units are Raspberry Pi-based, their peripherals are not natively supported by Ubuntu Core - and after some initial attempts, we determined that we don’t have the resources to do the relevant hardware enablement. It was more straightforward to use the vendor-supplied software images and device drivers, and install snaps on top.
  2. We never quite built up the confidence that Ubuntu Core was sufficiently robust, to warrant going all-in on it. The way it restricts what can be done with a unit (and various stories about bricked deployments on this forum) also gave us pause.

So I can’t rule out that using Ubuntu Core would have resulted in a much better experience - though based on my understanding of how the ecosystem fits together, I feel that most of my points would still apply.

Caveat 2: So far we’ve only used free (as in beer) services, and not paid for professional development or support from Canonical, or a brand store, etc. It’s quite possible that companies that use paid services somehow get a wholly different experience. Though I’ve tried to focus my points on the underlying software and architecture itself.

Now that I’ve got this out of the way, here are my takeaways, both positive and negative:

1. The developer and distribution experience is quite good

Reading this forum, I see a lot of “mixed reviews” regarding the snap development experience - e.g. this recent thread. Despite a few hitches, our experience of packaging our software as snaps has been quite good.

Doing the initial snap packaging - and maintenance since then - has not been too onerous; in no small part as a result of some very helpful individuals (including the Canonical team) on this forum. There are the occasional glitches, and half-documented features, but these don’t usually lead to critical issues. The fact that there is a free service that hooks into GitHub and automatically packages and distributes snaps for multiple architectures has been quite pleasing!

Finally, I know that the Canonical-run closed-source snap store can divide opinion. I would say it actually works quite well for software distribution in an IoT context. I suspect that the (paid-for) brand store is also a nice upgrade for those that need more fine-grained control.

2. The sandbox mechanism is pointless, and it’s a pain

I find it unfortunate that strict confinement plays such a major role in Ubuntu Core’s marketing material. I’m happy to be corrected, but I fail to see how it meaningfully improves security in an IoT context.

Specifically, I don’t believe that sandboxing mitigates the kinds of security vulnerabilities that are prevalent in IoT. From what I’ve seen, the majority of exploits involve weak application security, which allows an attacker to use a pwned IoT device to infiltrate the local network. In other words an insecure IoT device is rarely the target - it’s simply a jumping-off point for a broader attack. Application sandboxing generally cannot help prevent this.

And in cases where the IoT device - such as a security camera - is the target, sandboxing also does nothing useful. The software on an IoT device fundamentally needs to have access to its interfaces (camera, network) and application data; so an exploit like RCE will always get the attacker access to those, regardless of sandboxing.

The sandbox paradigm is primarily relevant when running untrusted applications on a desktop. It does little in an IoT context.

Yet it is consistently one of the biggest causes of headaches during development and deployment of snaps. Diagnosing confinement-related issues, and figuring out the necessary plug/slot setup (not to mention patching 3rd-party libraries), always feels like a ritual to the sun gods.

3. The update mechanism does not work well in bandwidth-constrained IoT contexts

Most of our devices are in remote locations with varying levels of mobile connectivity. This naturally leads to issues irrespective of update mechanism; but I feel that the snap mechanism performs particularly poorly in such an environment. In fact, we now disable the snapd service altogether, and only do updates on-demand in a more controlled manner.

Underpinning this statement are the following points:

  1. Due to the way that snaps bundle dependencies, the package sizes are generally quite large (in comparison to e.g. a deb package). This is design choice is well-covered elsewhere and I won’t dwell on it here.
  2. There are frequent updates to the coreYY the snapd snaps, which use up a lot of bandwidth. Having these snaps installed is generally required in order to run others. I haven’t done an exact analysis here, but updates to these appear to come out every couple of weeks (on the stable channels), and collectively can weigh in at 100s of MB per month. This is not insignificant. Perhaps we need more bandwidth-friendly release channels that focus on sparse security updates.
  3. Delta updates could in theory solve much of this, but in reality they are still very heavy. This is due to the way they are implemented on top of a compressed snap, rather than the underlying content; I wrote about this here.
  4. This is more subjective, but the auto-refresh mechanism often feels buggy and fragile. Over time we’ve experienced many issues that appear to stem from slow/unreliable connections. The end result is often either a refresh that has hung and never resumes (i.e. snapd just gives up on future refreshes), or a refresh that retries and fails perpetually. More specifically, I am not convinced that automatic resumption of interrupted snap downloads actually works in practice: over time there have been multiple reports on this by myself as well as others. The general solution appears to be to log in and do a manual download and install. I understand that the root cause may be hard to pin down, but either way, the mechanism doesn’t appear to meet a baseline level of robustness. The Ubuntu Core homepage describes the system as “self-healing”; I’m not sure I’d agree.

4. Atomic updates work well, and snaps run reliably

This is a sort of counterpoint to the last one: when refreshes do finally happen, they tend to happen reliably. This is of course a core tenet of the architecture: once downloaded, the new version of a snap is mounted/linked over an old one in its entirety in an atomic way.

I’m yet to observe a situation where this has not worked as intended. In other words, I’m not aware of any units where the application ended up in a corrupt state because of snapd.

There is an extension to this point, which is that snapd is not even required for applications to run once they’ve been installed. When a snap is installed, snapd creates systemd services that handle things like image mounts and service starts. At that point snapd is generally no longer required for the application to run. So even if snapd itself becomes corrupt, any installed snaps will generally continue to run just fine. (see also discussion here)

5. There is a single point of failure that still frequently leads to corruption

Each unit has a /var/lib/snapd/state.json file that stores, well, the snapd state. If this file is somehow corrupted then snapd can no longer start; there is no fallback mechanism in place, and manual recovery is necessary. In fact recovery generally involves re-downloading and reinstalling all snaps from scratch, which can be a pain in a bandwidth-constrained environment.

Also, unfortunately it seems far too easy to corrupt this file - such as through power-off while snapd is writing to it after a refresh. It’s possible that we were very unlucky, but we’ve had it happen to at least a dozen of our units (I previously posted about this here and here - with some useful input from the community). I can’t help but feel that handling such critical metadata in a more robust way would be preferable. In fact, it seems out-of-step with the atomic and failure-proof way in which snap updates themselves are handled.

At least the good news - relating to my previous point - is that even if the state.json file is corrupted, any installed snaps should continue to run fine.

Right, this is about it for now. As I said, I wanted to share this as a way to start a conversation. Are my experiences in line with those of others? Do my take-aways resonate, or are they misguided?

13 Likes

Thanks for taking the time to post this. My focus is on Snapcraft the tool and I am glad you are enjoying it where others are not.

Is there any missing feature on the Snapcraft tool side that you would like to see? no promises on implementing right away, but I can at least start thinking about it :slight_smile:

I’m curious to know if you think you’ve had less glitches on the github/launchpad setup versus locally building the snap? I’ve also experienced some flakiness building on a raspberry pi 4b (4gb). I’m suspicious that its due to running out of memory, which probably wouldn’t happen on launchpad.

Strict confinement keeps applications in a box, away from other applications or the base os (unless in defined interactions). This allows Ubuntu Core to have an immutable base os which is great for security and stability because it limits the scope of damage if an application is vulnerable. Strict confinement also allows updates to be applied more confidently which is good for security too. You might be able to find more reasons in the Ubuntu Core security whitepaper.

:rofl: I feel you there. Some supported interfaces have a single sentence describing them and there are many, many interfaces to search through.

Have you tried the new lzo compression?

Could you leverage the new snapcraft refresh hold?


Thank you for sharing your experiences!

Thanks for the comments @sergiusens and @picchietti! Just to pick up on a few points.

As an anecdote, earlier today I realized that I need to install one of our snaps on an AMD64 machine…but so far we’d only built and deployed it to ARM, so there was no AMD64 version available. I pulled up snapcraft.yaml in the GitHub repo, added a - build-on: amd64 line to the architectures section, committed, and about 10 minutes later the AMD64 snap was released. That was pretty neat!

Good question. More so than new features, from my perspective there is some room to improve the reliability (/ensure there are no bugs). Every once in a while I seem to run into glitches - whether inherent to snapcraft itself or the CI build system - and tackling these can be a real time-sink. I felt that the upgrades to v7 (and core22) introduced a few issues in particular (e.g. this)…maybe it wasn’t quite ready for prime time.

We actually pretty much exclusively use GitHub+Launchpad CI for builds. It also makes it easier to collaborate with other team members, as not everyone has a build-ready RPi ready to go. The main reason for occasionally doing local builds is to test major changes, since it’s easier to troubleshoot and iterate locally. But yeah, 99% of the time the CI builds work great!

That’s true. My point is that this confinement guards against a class of vulnerabilities that - while hypothetically exists - is not actually common in the real world. I.e., it seems quite rare for hackers/exploits to try to remotely modify the OS or install software. Perhaps because that’s harder to do (at scale) anyway, while not being that beneficial.

This is subjective of course - and I can’t pretend to have done a comprehensive study on the topic. But if were to take a seemingly representative list like https://www.iotforall.com/5-worst-iot-hacking-vulnerabilities, or OWASP’s top-10, none of these are things where sandboxing would really make a difference.

To be clear, Ubuntu Core does have important security features that counter common IoT vulnerabilities: automatic updates, update verification, no default credentials. But I’m not convinced that strict confinement adds much in that regard.

Again, this is fairly subjective, and I welcome indications to the contrary. Also probably a whole separate topic in itself…

Interesting point - I haven’t. From what I understand the idea there is to speed up decompression (and app start-up times), at the expense of a larger snap size. So at first glance it doesn’t seem like something that would help here.

Yeah, I should definitely look into that further! That option wasn’t around until relatively recently…so over time we set up other mechanisms to control updates (like a read-only root) - but those have still not been ideal. So I agree that simply adopting some sort of hold pattern is likely to be a more elegant overall solution, which we should investigate!

Thank you again for the thoughtful comments and input.

1 Like

It’s very interesting to read that snap appears to be too desktop-focussed. As a community member who’s mainly working with snaps on the desktop, it always feels like the desktop use-case was an afterthought and isn’t invested into enough.

In terms of security, one of the big advantages of snap systems is Ubuntu Core itself. This removes the need for device vendors like yourself to roll their own distro and ensure timely updates of this distro. Although your reasons for not using Ubuntu core are understandable, I think this means you’re missing out on some of its best features for IoT.

I don’t think “only using free (as in beer) services” should be a big issue. I know a bunch of success stories of companies using snaps and Ubuntu core in IoT without paying for Canonical products.

2 Likes

Despite my initial dismissal of this, you might be on to something. I actually just experimented with this, and in a totally artificial test case (that wasn’t even a snap), the delta between two similar LZO-compressed images was ~13x smaller than the delta between the same images with XZ compression. So it’s quite possible that LZO is just much more “delta-friendly” so to speak. I added a post to the other thread:

Definitely worth investigating further!

1 Like