In late 2017 I started playing around with Ubuntu Core, and soon after ended up deploying snaps to a fleet of IoT devices. Over the past 5 years, at my start-up (AMMP), we’ve used snaps to deploy software to several hundred dataloggers monitoring renewable energy systems in remote locations. I’ve had many reflections on the experience over this period, and I thought it’s time to share some of the main ones with the community.
My one-sentence takeaway is that I don’t think snaps are set up to be a serious contender in the IoT space. There aren’t many issues that make snaps fundamentally unfit for IoT (though there are some). What primarily worries me is that the direction of development - both past and present - appears to be geared towards improving the snap experience on the desktop. Truly cracking this is a major undertaking. And I believe that it will be fundamentally difficult to have a single package architecture that caters well to both desktop as well as IoT needs. The trade-offs are just too stark. I elaborate on this in the points below.
First, a couple of caveats:
Caveat 1: After trying to work with Ubuntu Core at the start, we did not use it in real-world deployments. Instead we installed snaps on top of other Debian-based OSes. This was for two reasons:
- Due to functionality requirements, we use hardware from vendors like UniPi, Sfera Labs, and Moxa. Even though most of the units are Raspberry Pi-based, their peripherals are not natively supported by Ubuntu Core - and after some initial attempts, we determined that we don’t have the resources to do the relevant hardware enablement. It was more straightforward to use the vendor-supplied software images and device drivers, and install snaps on top.
- We never quite built up the confidence that Ubuntu Core was sufficiently robust, to warrant going all-in on it. The way it restricts what can be done with a unit (and various stories about bricked deployments on this forum) also gave us pause.
So I can’t rule out that using Ubuntu Core would have resulted in a much better experience - though based on my understanding of how the ecosystem fits together, I feel that most of my points would still apply.
Caveat 2: So far we’ve only used free (as in beer) services, and not paid for professional development or support from Canonical, or a brand store, etc. It’s quite possible that companies that use paid services somehow get a wholly different experience. Though I’ve tried to focus my points on the underlying software and architecture itself.
Now that I’ve got this out of the way, here are my takeaways, both positive and negative:
1. The developer and distribution experience is quite good
Reading this forum, I see a lot of “mixed reviews” regarding the snap development experience - e.g. this recent thread. Despite a few hitches, our experience of packaging our software as snaps has been quite good.
Doing the initial snap packaging - and maintenance since then - has not been too onerous; in no small part as a result of some very helpful individuals (including the Canonical team) on this forum. There are the occasional glitches, and half-documented features, but these don’t usually lead to critical issues. The fact that there is a free service that hooks into GitHub and automatically packages and distributes snaps for multiple architectures has been quite pleasing!
Finally, I know that the Canonical-run closed-source snap store can divide opinion. I would say it actually works quite well for software distribution in an IoT context. I suspect that the (paid-for) brand store is also a nice upgrade for those that need more fine-grained control.
2. The sandbox mechanism is pointless, and it’s a pain
I find it unfortunate that strict confinement plays such a major role in Ubuntu Core’s marketing material. I’m happy to be corrected, but I fail to see how it meaningfully improves security in an IoT context.
Specifically, I don’t believe that sandboxing mitigates the kinds of security vulnerabilities that are prevalent in IoT. From what I’ve seen, the majority of exploits involve weak application security, which allows an attacker to use a pwned IoT device to infiltrate the local network. In other words an insecure IoT device is rarely the target - it’s simply a jumping-off point for a broader attack. Application sandboxing generally cannot help prevent this.
And in cases where the IoT device - such as a security camera - is the target, sandboxing also does nothing useful. The software on an IoT device fundamentally needs to have access to its interfaces (camera, network) and application data; so an exploit like RCE will always get the attacker access to those, regardless of sandboxing.
The sandbox paradigm is primarily relevant when running untrusted applications on a desktop. It does little in an IoT context.
Yet it is consistently one of the biggest causes of headaches during development and deployment of snaps. Diagnosing confinement-related issues, and figuring out the necessary plug/slot setup (not to mention patching 3rd-party libraries), always feels like a ritual to the sun gods.
3. The update mechanism does not work well in bandwidth-constrained IoT contexts
Most of our devices are in remote locations with varying levels of mobile connectivity. This naturally leads to issues irrespective of update mechanism; but I feel that the snap mechanism performs particularly poorly in such an environment. In fact, we now disable the
snapd service altogether, and only do updates on-demand in a more controlled manner.
Underpinning this statement are the following points:
- Due to the way that snaps bundle dependencies, the package sizes are generally quite large (in comparison to e.g. a deb package). This is design choice is well-covered elsewhere and I won’t dwell on it here.
- There are frequent updates to the
snapdsnaps, which use up a lot of bandwidth. Having these snaps installed is generally required in order to run others. I haven’t done an exact analysis here, but updates to these appear to come out every couple of weeks (on the stable channels), and collectively can weigh in at 100s of MB per month. This is not insignificant. Perhaps we need more bandwidth-friendly release channels that focus on sparse security updates.
- Delta updates could in theory solve much of this, but in reality they are still very heavy. This is due to the way they are implemented on top of a compressed snap, rather than the underlying content; I wrote about this here.
- This is more subjective, but the auto-refresh mechanism often feels buggy and fragile. Over time we’ve experienced many issues that appear to stem from slow/unreliable connections. The end result is often either a refresh that has hung and never resumes (i.e.
snapdjust gives up on future refreshes), or a refresh that retries and fails perpetually. More specifically, I am not convinced that automatic resumption of interrupted snap downloads actually works in practice: over time there have been multiple reports on this by myself as well as others. The general solution appears to be to log in and do a manual download and install. I understand that the root cause may be hard to pin down, but either way, the mechanism doesn’t appear to meet a baseline level of robustness. The Ubuntu Core homepage describes the system as “self-healing”; I’m not sure I’d agree.
4. Atomic updates work well, and snaps run reliably
This is a sort of counterpoint to the last one: when refreshes do finally happen, they tend to happen reliably. This is of course a core tenet of the architecture: once downloaded, the new version of a snap is mounted/linked over an old one in its entirety in an atomic way.
I’m yet to observe a situation where this has not worked as intended. In other words, I’m not aware of any units where the application ended up in a corrupt state because of
There is an extension to this point, which is that
snapd is not even required for applications to run once they’ve been installed. When a snap is installed,
systemd services that handle things like image mounts and service starts. At that point
snapd is generally no longer required for the application to run. So even if
snapd itself becomes corrupt, any installed snaps will generally continue to run just fine. (see also discussion here)
5. There is a single point of failure that still frequently leads to corruption
Each unit has a
/var/lib/snapd/state.json file that stores, well, the
snapd state. If this file is somehow corrupted then
snapd can no longer start; there is no fallback mechanism in place, and manual recovery is necessary. In fact recovery generally involves re-downloading and reinstalling all snaps from scratch, which can be a pain in a bandwidth-constrained environment.
Also, unfortunately it seems far too easy to corrupt this file - such as through power-off while
snapd is writing to it after a refresh. It’s possible that we were very unlucky, but we’ve had it happen to at least a dozen of our units (I previously posted about this here and here - with some useful input from the community). I can’t help but feel that handling such critical metadata in a more robust way would be preferable. In fact, it seems out-of-step with the atomic and failure-proof way in which snap updates themselves are handled.
At least the good news - relating to my previous point - is that even if the
state.json file is corrupted, any installed snaps should continue to run fine.
Right, this is about it for now. As I said, I wanted to share this as a way to start a conversation. Are my experiences in line with those of others? Do my take-aways resonate, or are they misguided?