Concerns about consistency and data corruption during snap refresh

knetch · November 29, 2018, 12:05am

I think in it’s current form the update process for snaps has multiple issues that can cause serious data loss. All these issues are due to the fact that a snap may still run while it is being updated.

When files are copied from one revision to another revision, the consistency of the data can not be guaranteed because the copy is not atomic. This may even lead to the uselessness of the new data. The corruption may even go unnoticed until it is to late.
When the running snap wants to write data to it’s revision location after the update, those data are lost in the new revision. Again the data loss may go unnoticed.
$SNAP_USER_COMMON is not a workaround. If the new revision starts before the old one stops, we have two different versions that access the same data at the same time. I think that calls for problems.

I think snaps should never be refreshed while they are running. An update while a snap is running would not change anything anyway, because the old version is still running.

chipaca · November 29, 2018, 8:36am

We’re aware of these issues. We’ll be working on a feature we call “prevent refreshes while running” (catchy name) which should solve these.

As part of the same work we’ll be adding a mechanism for a snapped app to alert the user (“hey a new revision is ready to be installed yadda yadda”).

Not there yet though.

knetch · November 29, 2018, 1:11pm

That sounds fantastic! Is the status of this tracked somewhere?

chipaca · November 29, 2018, 1:48pm

ijohnson · November 29, 2018, 2:02pm

Is this the same problem as https://bugs.launchpad.net/snapd/+bug/1616650 ?

chipaca · November 29, 2018, 2:14pm

Ah, drat, that trello is private. I should’ve linked to

and @ijohnson yes that topic is linked to from the bug.

knetch · November 29, 2018, 4:47pm

Okay, thank you!

Another (possible) hiccup came across my mind. I do regular crash consistent backups of my home directory using btrfs snapshots. Let’s say the last snapshot happened while snappy was copying data during an update.

I need to restore my system, so I install a fresh Ubuntu and restore my home directory. When I now install the latest version of the snap that was being updated during the backup, does snappy correctly detect that there was an update in progress and discard any partial copied data?

chipaca · November 29, 2018, 5:02pm

if the snapshot includes the snapd state (/var/lib/snapd), then yes it should. If it does not, then no.

knetch · November 29, 2018, 5:17pm

Okay that is bad. I was hoping that a users home directory is treated as a self contained entity. I think backups of the home directory only are pretty common so that issue should at least be documented or better supported. A user may want to do a backup of only his own data and the home directory is a pretty good boundary.

Once “prevent refreshes while running” is implemented. Will there be an option to completely disable copying of user data to avoid such issues? For example when setting system refresh.retain=1 a move operation could be used instead of copying. I know that prevent revert but that is what I have backups for.

chipaca · November 29, 2018, 7:37pm

@knetch to make your scenario a little bit more interesting to you, let me point out that if you make a backup of your home when a snap is at revision 10, and then when you restore from that backup the snap is at revision 12, and you then install the snap, it won’t see the per-revision data (and it might not be able to read the non-revisioned data).

knetch · November 29, 2018, 10:22pm

Hm, that’s true. But I think there are ways that this could be handled in a more clear and less error prone way. First the simple case when system refresh.retain=1:

Version x of a snap is installed, the data is on version y. User tries to start the snap.

When y=x or y=x-1 just start the snap. (Moving x to x+1 when y=x-1)
Else notify the user that the data format is incompatible and he has to install version y of the snap to continue.

Second case, system refresh.retain=2+:

Compared to system refresh.retain=1 this does additionally protect against updates gone wrong.

Only the latest version of the app data should be writable. This is important, so that we never get two diverging versions of the same data. I think that would make it too easy to shoot yourself in the foot.
When the snap is started do apply the same logic as in case 1 with y being the latest data version of the snap.
The user notices, that the update is gone wrong and wants to revert to a previous version of the snap. Snappy asks the user that it needs to delete the newer version of the data (again to not get two diverging versions). Maybe it could offer an option to make a backup or forcefully move the new data to the old version.

Additionally: Support atomic backups of the users snap directory, so he is able to backup his data while the system is running.

Some other use cases where it is important to treat the home directory as self contained entity:

A user wants to encrypt his home directory. When a snap gets updated while the user is logged out, snappy can’t access the users snaps and can’t do the update. It needs to notify the user when the snap got updated more than one time while he wasn’t logged in.
At my university the home directories are on a network file system and they can be accessed from different computers. Again snappy needs to handle cases where system and user data diverge. When in question it is always better to ask the user.

I’m not 100% certain how snappy handles all these scenarios currently. That’s because I couldn’t find good documentation on how exactly snappy handles all these cases. Sorry for the long text but I think it is extremely important to handle these things in a way that makes data loss and confusion about what happens with your data as hard as possible.

Edit:
I did some experiments with the revert and refresh mechanism and it looks like what snappy is doing on a revert is to first keep the data of the new version. But when the snap is refreshed again it gets overwritten with the then current data from the old revision. That seems reasonable to me. The only thing that’s really missing then is to treat the home directories independent to support the scenarios I described. (backups, encrypted home, network filesystem, maybe missing something?)