A while ago we were tracking a bug where refreshing to a new core broke LXD. The issue was traced to the fact that after such refresh, the LXD snap would appear to have no plugs or slots defined. After extensive analysis we found that the following has happened:
snapd would refresh both the core snap and the lxd snap.
lxd is stopped and deactivated for the refresh
the refresh change got to the point where core was waiting to do 2nd phase interface setup
snapd restarts and picks up where it left off, activating lxd and saying “all good”
but, the damage is done now, when snapd.service restarts some ephemeral state was lost
When snapd starts up it has to add interfaces to the ephemeral interface repository, a collection of plugs and slots and associated snaps. The repository tracks the state as it is now, so it doesn’t keep track of past, inactive revisions. When the daemon starts up and constructs the overlord, the overlord constructs the interface manager and then the interface manager populates the interface repository. The condition used to check if a snap needs to be added to the repository is (or rather was back then when the bug existed) was that the snap was marked as active in the snap state.
If snapd had not re-started for core update, then the LXD snap would still be in the interface repository and would correctly pass through the activation and re-connection phase. Since after snapd restarted the snap was still partially through the refresh process, it would not be added to the repository. The following refresh tasks would activate it, which would setup security, but without the interfaces security the application would not operate correctly.
The proposed fix was to look through the ongoing tasks and cherry pick information about snaps being inactive but currently refreshed and also add them to the interface repository on snapd startup.
This works fine. The problem is that this highlights the fact that being active in the snap state is not the same as being in the interface repository. The workaround that was merged works in the “do” direction but doesn’t in the “undo”, in case something goes wrong. We said that at the time the logic was already a bit convoluted and explicit new state is a clean way out of the problem. This is what was proposed as a stepping stone towards a more direct and simple solution.
Having written this I’m not sure of the following things:
Is the case where lxd and core refresh together racy? Can LXD successfully refresh before core refresh reaches the restart moment? My memory is rusty now but I suspect the answer is yes, the bug was easy to reproduce but not everyone was affected
How is the problem affected by the removal of 2nd phase security setup?
There are two places that adds plugs and slots to the in-memory repository, notice that in both cases we add/need to add the current or to be current revision of the snap:
the setup-profiles ifacestate task
code at startup that goes over (ignoring current workarounds) active snaps and adds plugs and slots from their current revision
We should remember that:
during install the snap is inactive until the link-snap task, also until that point the snap has no entry in the "snaps" mapping from snap name to SnapState
during refresh is the same, the snap gets inactive until the link-snap task, and on undo would also stay inactive until the relinking of what was the current revision
link-snap is a snapstate task, setup-profiles is an ifacestate task.
setup-profiles happens just before link-snap.
In the Do case we have a window between the two where if snapd restarts, the plugs and slots of the snap should have been added to the in-memory repo and are assumed so, but that is not covered by the startup code that considers active snaps only.
In the Undo case we have even a larger window of tasks since undo of setup-profiles puts back the previous revision slots and plugs, until the undo of unlink-current-snap makes the snap active again at that previous version.
Crashes/restarts in those windows result coming back in a snapd with an inconsistent repo state for those snaps being operated on.
Here we have a mismatch in ifacestate trying to reuse snapstate state which gets updated according to snapstate rules but because of the task/responsibility split ends up being insufficient, not representing the effects of ifacestate.
After discussion current idea to explore is to, coalesce setting a snap to active with setting its security profiles, concretely:
coalesce the Dosetup-profiles logic into link-snapDo logic
coalesce the Undosetup-profiles logic into Undounlink-current-snap logic (and theoretically Undounlink-snap logic but that doesn’t exist currently)
For completeness other cases of *link*snap logic would need to have remove-profiles logic coelesced into them, but it might not be strictly necessary. It means unused profiles could stick around while the snap is inactive.
One issue here is that the atomicity of setup-profiles, which does release the state lock and might touch more than one snap profiles, and also of other tasks/ops of ifacestate manager, is guaranteed through a simple blocked predicate set on the ifacestate manager TaskRunner. This breaks down if part of those tasks are executed instead under tasks of the snapstate manager and its runner, and needs a solution.
a relatively simple solution but needing some work is to migrate to have just one shared TaskRunner, needs some double checking and care, but I think the mostly affected area will be tests that will need a bit of changes, some test patterns will need to change.