New `configure-snapd` task and reverting

mvo · November 10, 2017, 3:19pm

In master we added a new configure-snapd task. This replaces the old “configure” hook to configure the “core” snap. This opens the following problem on core devices. When going from a snapd A that supports this new “configure-snapd” task to a snapd B that does not supports this new task the result will be that in the refresh/revert change there is a task in “Do” state that will never run because the B does not know this task so it will never finish.

This is easy to trigger. On a Ubuntu Core 16 device run:

$ sudo snap refresh --edge core
[reboot]

$ sudo snap refresh --candidate core
[reboot]

$ snap changes
...
5    Do      2017-11-10T15:25:43Z  -                     Refresh "core" snap from "candidate" channel

$ snap change 5
...
Do      2017-11-10T15:25:43Z  -                     Run configure hook of "core" snap if present

$ sudo snap refresh core --edge
error: cannot refresh "core": snap "core" has changes in progress

So there is no refresh/auto-refresh out of this situation.

The good news is that its easy to resolve manually:

$ sudo snap abort 5

But we need something that does not require manual intervention here and fix edge to be robust.

niemeyer · November 10, 2017, 5:01pm

After some good discussion on IRC with @mvo and @pedronis, there are two independent things we want to do. One addresses this specific case, the other addresses similar cases in the future (right now non-existent).

The case of configure-snapd task

In the specific case of configure-snapd, the ideal scenario is that we never create the task to begin with, because there’s very little benefit in running a configure task from snapd that will configure snapd when snapd updates snapd! Does that sound repetitive? Well, exactly. When such an update happens, snapd is already in control, and if it needs to perform any changes it can always just do them, and in the specific places that need the change. This is much safer than the danger of doing such re-entrant logic in sensitive scenarios. This is also quite simple to do right.

Note that this change should not affect the behavior of the snap set command. This will still create the task.

All similar cases in the future

The second thing we want to do is to find a way to prevent similar problems from affecting any other task kinds in the future. That’s not a real problem today, but it may be once we add more task kinds that fall onto the critical pipeline of refreshes and reverts. So at some point in the upcoming releases, we should change snapd so that unknown tasks are aborted in short time. This requires a bit of care because right now our task handlers are completely independent from one another, which means there’s no single place with global knowledge of all task handlers. So we need to carefully design a mechanism that can decide when a task ought to be handled already, but wasn’t and most likely (or definitely) can’t be handled by the current snapd. This will fix any future cases similar to the original problem we observed here.

pedronis · November 10, 2017, 5:12pm

We need to be a bit careful with the first boot code, there is no restart there usually, and it now schedules running configuration of core etc based on possible gadget default values, exactly between installing the fundamental snaps (core, kernel, gadget) and the other required snaps

pstolowski · December 15, 2017, 4:53pm

I’ve started looking into this as it’s very desirable before my changes to autoconnect land.
I’m thinking of the following high-level plan:

Introduce a helper in task runner and all managers to expose the names (kinds) of tasks they support. Overlord will use these helpers to gather all known task kinds.
Knowing all the supported kinds of tasks, overlord will process all tasks in the state and will deal with unknown tasks. This will mostly likely be realized by a specialized implementation of task runner which only handles “unknown” tasks and aborts them by default.
It will be possible for tasks to define a flag to override the default behavior of that special taskrunner; if the flag is present, instead of getting aborted they will be marked ‘done’. We can use it for future tasks we know are not critical, for example when we split an existing task into two (keeping existing task kind + adding a new one), an old snapd can still handle “the old task kind” from before the splitting and ignore the new task kind.

The PR adding new helper is here: https://github.com/snapcore/snapd/pull/4405

niemeyer · December 18, 2017, 3:54pm

Thanks for looking into this. The basic idea sounds sane, but it feels like the over customization of done vs. abort is unnecessary for the time being. We have a single use case so far, and this use case is the one requiring the handling as Done instead of Abort. So it feels a bit like over-generalization to have these further knobs that need to be tuned just to get the single behavior we know we want right now.

Current PR looks good, though.

pstolowski · December 18, 2017, 4:13pm

I agree we don’t have immediate case for customization of done vs abort, however we have to remember that the sooner we introduce something like this the better - we will only be able to revert to the first revision that’s “flexible enough” to deal with unknown tasks.

pstolowski · January 4, 2018, 2:14pm

PR 4405 has landed. The followup PR that implements the new logic is here: https://github.com/snapcore/snapd/pull/4440