Maybe a pie-in-the-sky suggestion, but figured I’d share here and gauge interest.
Some time ago I posted about a non-functioning device, where the root cause was a corrupt
/var/lib/snapd/state.json file: Snapd: "cannot run daemon: cannot read state: unexpected EOF". Since then we’ve had this issue come up with some frequency, in a number of other devices running in the field.
I suspect this is most likely caused by SD card corruption, caused by a power outage during writing. Though that root cause is somewhat irrelevant. What I want to put forward is the following:
- the integrity of the
state.json file seems to be a uniquely fragile failure point, which can single-handedly render all
snap processes non-functional
- there are probably better ways to store critical state like this
To the second point, as a reference all these devices also have SQLite databases that store application state, which also get written to quite frequently - and we are yet to see any corruption there. Given the behavior of SQLite under unexpected conditions like this, the above observation is probably not coincidental (https://www.sqlite.org/howtocorrupt.html).
To finish, is there any merit in trying to push forward any changes to the way that
state.json is stored? (E.g. by opening a Launchpad ticket) Or are any changes like this not likely for the foreseeable future?
I do recall some discussions from a few years back, but I think those were mostly about performance. It’s a JSON blob, so marshalling/unmarshalling could be a problem, but since it is possible to unmarshal only parts of the tree it’s less costly, but still observable. Corruption could be a problem, but we try to make sue that the application side of things does everything right. This is not to say that there is no problem at all, but addressing this is in a way that it works everything is not trivial. I do recall some reports of filesystem corruption that affected the state, but those were very few, and I don’t think I recall Ubuntu Core being affected.
Core specific, there were other problems, eg FAT corruption on boot partitions on RPI, but then IIRC this was a single device where it happened multiple times after which they replaced the SD card and there were no further reports.
If you have some samples of the corrupt state, then filing a LP bug is probably the best way to go. I’d the great to include some logs from the system to have an idea of what was going on at the time the power was cut.
Thanks @mborzecki for the context. I indeed suspected that this would be an edge case/not a priority, but just wanted to put it out there.
state.json files I’ve seen generally look like valid JSON that’s been truncated - with binary garbage appended at the end (sometimes a bit, sometimes a lot of it). I’m not an expert, but I strongly suspect it’s the result of SD card corruption; I’m conscious that the devices in question may frequently experience power failures, and may sometimes have low-quality SD cards installed.
If it’s not an issue that’s observed more widely, then the current approach is probably fine for the general user base.
FWIW - and this may make many people cringe - but what we’ve done as a work-around is set a read-only root partition, with
snapd disabled by default. The partition is then remounted R/W and
snapd started on occasion.
Apparently this issue also affect regular users as well, can the state be stored in something more resilient than a JSON file?