A while ago we were tracking a bug where refreshing to a new core broke LXD. The issue was traced to the fact that after such refresh, the LXD snap would appear to have no plugs or slots defined. After extensive analysis we found that the following has happened:
- snapd would refresh both the
coresnap and the
- lxd is stopped and deactivated for the refresh
- the refresh change got to the point where
corewas waiting to do 2nd phase interface setup
- snapd restarts and picks up where it left off, activating lxd and saying “all good”
- but, the damage is done now, when snapd.service restarts some ephemeral state was lost
When snapd starts up it has to add interfaces to the ephemeral interface repository, a collection of plugs and slots and associated snaps. The repository tracks the state as it is now, so it doesn’t keep track of past, inactive revisions. When the daemon starts up and constructs the overlord, the overlord constructs the interface manager and then the interface manager populates the interface repository. The condition used to check if a snap needs to be added to the repository is (or rather was back then when the bug existed) was that the snap was marked as active in the snap state.
If snapd had not re-started for core update, then the LXD snap would still be in the interface repository and would correctly pass through the activation and re-connection phase. Since after snapd restarted the snap was still partially through the refresh process, it would not be added to the repository. The following refresh tasks would activate it, which would setup security, but without the interfaces security the application would not operate correctly.
The proposed fix was to look through the ongoing tasks and cherry pick information about snaps being inactive but currently refreshed and also add them to the interface repository on snapd startup.
This works fine. The problem is that this highlights the fact that being active in the snap state is not the same as being in the interface repository. The workaround that was merged works in the “do” direction but doesn’t in the “undo”, in case something goes wrong. We said that at the time the logic was already a bit convoluted and explicit new state is a clean way out of the problem. This is what was proposed as a stepping stone towards a more direct and simple solution.
Having written this I’m not sure of the following things:
Is the case where lxd and core refresh together racy? Can LXD successfully refresh before core refresh reaches the restart moment? My memory is rusty now but I suspect the answer is yes, the bug was easy to reproduce but not everyone was affected
How is the problem affected by the removal of 2nd phase security setup?