How to cause a snap rollback?

Saviq · February 6, 2019, 7:04pm

Hey all,

We want to transition Multipass to core18 and realized there’s a case where the upgrade needs user intervention to avoid data loss, what’s the state-of-the-art approach to this situation?

forcing a rollback (how)?
failing one of the hooks (which)?
epochs are not a thing yet, are they? would they be of help?

chipaca · February 6, 2019, 7:29pm

The basic epoch functionality is there, but the stepped upgrade isn’t, nor is the documentation. Not sure how that would help, if there’s a manual step. Sounds like the pre-refresh hook would be your best bet.

I’m curious what the needed intervention is though.

Saviq · February 6, 2019, 8:48pm

On Multipass refresh (or stop, to be exact), any running instances get suspended with a snapshot. Then, on starting again those that were running before get resumed. That snapshot is not compatible between 16.04 and 18.04 versions of qemu, meaning that on that transition those instances can’t be resumed.

The only safe option is to let the user know they need to save any unsaved data they have in the instances and shut them down. There’s no conversion that we can do, save keeping both versions of qemu in the snap and flipping them to 18.04 on shutdown…

Can we fail the pre-refresh hook? Can we even know what version of the snap will be next?

kyrofa · February 6, 2019, 8:54pm

I believe the pre-reresh hook is run in the previous snap, not the new one. In that sense, no, I don’t think you can know which one is coming, and I don’t think it matters without epochs: if you release a snap with such a pre-refresh and then release the new one, it’s entirely possible for folks to skip the one with the pre-refresh and go straight to the new one. It seems the only solution here is to actually use a post-refresh, but I expect a failure in that hook will actually mark the new snap revision as bad for that system and the upgrade won’t be attempted again automatically until you release a new rev (although I think they can manually refresh it).

Saviq · February 6, 2019, 9:29pm

Right, so failing pre-refresh would only make sense with stepped upgrades. And even then, without knowing which way it’s going (the user might’ve requested to refresh to a lower revision) it might be wrong to fail that pre-refresh. It could be an annoyance in this particular case, but would keep the user’s data safe.

But without stepped upgrades…

kyrofa · February 6, 2019, 9:31pm

Maybe maintain a separate track for now and try to socialize it? Then keep an eye on your metrics and when near-zero folks are on stable, update it?

Saviq · February 6, 2019, 9:49pm

Yeah we don’t have stable yet (good riddance!) and I was thinking that we could keep beta on core16 and get people off it gradually.

chipaca · February 7, 2019, 9:57am

Keep beta on core (for now), and release the core18 to candidate or stable, with epoch: 1.

In my mind there has to be a way to convert the snapshots. Maybe ship qemu from both 16.04 and 18.04 in a same snap? One of them could be statically linked. In any case, once you know exactly how you want to transition your users from the old world to the new, you can release a transitional epoch: 1* snap to beta that implements that. Once most of your users in beta are on that, you can release epoch: 1 snaps to beta.

What isn’t released is the stepping of an upgrade through epochs in a single refresh. Stepping slowly already works. If a user is on beta (with epoch: 0) and tries to do snap refresh --candidate to something that has epoch: 1 and there is an epoch: 1* somewhere in the channel history to bounce through, the refresh will end up on that revision (with a warning about the candidate being closed), and then a later auto-refresh (or a manual snap refresh) will get them to the wanted revision.

Saviq · February 7, 2019, 4:48pm

Yeah if conversion was possible, we wouldn’t even need epochs. But we couldn’t find a way yet.

The keep-both-qemus scenario came to mind, but seems too much effort for little gain.

The plan of record is to detect the problem and let the user know how to solve it. See our GitHub issue for more details.