Operations on snaps are split over many tasks which can be chained for a single snap and operation, but overall would run concurrently, OTOH we need to guarantee consistent state both in memory and on disk for snaps.
A previous discussion and sharpening of tools relevant to that was here: Transactionality, locking and other concurrency coordination
So far the main mechanism to guarantee state consistency when splitting operations over many tasks has been to use conflict checks before creating the tasks and changes at all:
- at most one in-progress change can touch a given snap at any time (as we have grown task kinds and tasks can be contributed by many state managers the checks have now become somewhat needlessly complicated and the listing of relevant task kinds is fragile and may be wrong)
manipulating interface security profiles and connections has always been problematic from this point of view, because for its very nature it always touches more than one snap state at a time. The current/previous solution to that problem was to make sure at most one task operating on those is running at any time (ifacestate
uses SetBlocked
for this).
Challenges from recent features
A series of recent features and enhancements have stressed these old solutions and approaches
- base snaps
- we added code to dynamically install as needed bases or default-providers for content for a snap when it is being installed or refreshed
- we added more and invoke hooks more liberally
- WIP interface hooks mean that connect, disconnect and auto-connect are now split over many tasks and hooks, and arenât covered by a single self-contained task anymore, also it means further that operating on a snap can now trigger hooks on other snaps
These adds the following requirements:
- at high-level the base of a snap needs to be installed and active (current link set and stable) before the snap itself is installed
- the default-provider snaps of a snap should be present if possible and active before and during the auto-connect of that snap
- during the execution of a hook the hookâs own snap, its base, the core snap (soon the
snapd
snap) must be active (snap-confine
needs to consult and follow their current link) - same for when starting services of a snap (and similar)
Ad hoc solutions and problems
ATM for those features and to cover some of the requirements we introduced the following ad hoc approaches but still with open problems:
- To install the base or default-providers for content of a snap, if not present yet, we add install tasks dynamically to the current change; after they are added current conflict logic avoids further changes to be started on those snaps. (The new tasks for multiple snaps need to be added in a all-or-nothing fashion because they are equivalent to taking multiple locks because of the conflict logic.)
- Given that the presence check is done from a task, not up front, there can be concurrent operations that could influence it, so we have logic that checks for pending operations that can influence the active state of the relevant snaps and also for conflicts when we try to add the install operations. In both cases the task-generating task just retries (returns a
Retry
error) until it gets executed without interference. Notice here that if the conflict checks as they are currently are missing a relevant task kind the outcome might be fragile, flap for example if an undo is triggered. - Auto-connect also has similar challenges, as is now implemented as a task adding further connect required tasks and hooks and doing a best-effort of making sure the relevant snaps are active. Differently from bases and default-providers is in general hard to know up front which snaps will be involved in auto-connect.
- Hook execution at the moment doesnât do anything to make sure the relevant snaps are active. Usually the hookâs own snap is guaranteed indirectly by the general conflict checks, this is not true though for the snaps on the other side of an auto-connect. Also nothing is done about the base or core (soon
snapd
snap) being active if they were already present before the change or early for install/refresh changes. We have added chaining ensuring that for a multiple-snap refresh core and bases are operated before the dependent snaps, this doesnât cover all scenarios though.
We have also a higher-level modeling annoyance, we have first-class Change
and Task
, but changes have only informative kinds (at least thatâs how we have used them so far) and can bundle operations over many snaps, tasks have small granularity and can and are combined in different complex ways. We donât have a first-class entity corresponding to something like âinstallingâ or ârefreshingâ a single snap, we have introduced lanes but they are not first-class and mostly deal with abort logic. For example there is no cheap/direct way to ask whether a lane is ready or to know what the lane is accomplishing at a high-level. (OTOH we also support putting the same task in many lanes but that feature is AFAIK unused currently).
Ideas
Some possible suggestions about directions for simplifying things, addressing the problems in a more robust way and reducing the places in code that need to care:
-
Simplify the conflict logic so not to depend on task kinds anymore, but follow the spirit of at most one in-progress change explicitly touching any given snap.
-
We should prevent to remove (possibly also disable unless forced, though we donât have
--force
kind of flags yet anywhere) bases that have snaps using them. -
We could move to an up-front approach for bases and default-providers installation to be added to an install or refresh change. Making this work would need to extend conflict checks also to cover the bases of the snaps operated on.
That alone would not be enough, for multi-snap operations we would need to use order (as we do for multi-snap refreshes already) so that bases and default-providers are operated on first. While bases cannot, default-providers could form chains or even loops and that would need some care (topological sorting, breaking ties somehow).
This means though that while now starting simultaneously two changes installing two different snaps that need the same not-yet-present base can be done, in this approach it would give a conflict.
The issue of auto-connect and robustly running of hooks for the other side of auto-connection is not solved by this though. So while something to consider itâs not a full solution.
-
We could teach specifically to the hook runner task to wait on pending active state toggling operations for the relevant snaps and further also block (using in-memory only structures given that hook running is one task should be enough) the creation of more of such operations. The cost/benefit of this is unclear though. The most problematic hooks are the interface hooks for the other implicit side for the auto-connect case, if the operation we waited on was a removal, all we can do is fail and is already too late: we should not have scheduled the hooks and
connect
task to start with. -
Broadly speaking we need to accept that running through the
unlink=>setup-profiles=>link=>auto-connect=>other-ops-that-could-cause-undo
sequence for a base snap cannot happen at the same time as we operate on a snap using that base. Same soon for thesnapd
snap. Same for a snap that could be involved in an auto-connection.In a first approximation we could add code that stops such sequences (and similar that affect the current link of snaps) to happen concurrently or concurrently with running hooks.
With current tools that could be done with some shared blocked predicate to be used with all task runnersâ
SetBlocked
. This predicate would need to look at all changes to make its decisions.Downloads would still be concurrent because they happen before the unlink of the snap. Copying of snap data wouldnât because it is after.
To solve that and regain most concurrency with a bit more complexity we could consider this two different sets of sequences of operations and following CONSTRAINTS:
- at most one current-link affecting operation sequence on bases (soon also
snapd
snap), and snaps with slots - otherwise many operations running hooks, current-link affecting operation sequences for application snaps with only plugs
Under the CONSTRAINTS then the prerequisite snaps for hook running, auto-connect etc. would be active and stably so.
Without going into details, the blocking predicate (see below for an initial sketch) here would probably be somewhat expensive, about the latter we can use caching, and usually recheck the last time observed situation (a set of tuples of roughly (snap, change, lane)) first.
As hinted before some complexity in doing this would come from the fact that something like âinstallingâ/ârefreshingâ a single snap doesnât have right now a first class representation in the system. Lanes as we use them right now come the closest.
- at most one current-link affecting operation sequence on bases (soon also
Concretely:
- (1.) and (2.) are things we should, I think, definitely do
- (4.) was worth mentioning but the cost/benefit does not look attractive
- (3.) is worth thinking about
- (5.) cost/benefit is not completely ideal because of our current tools/modelling but is worth exploring and would solve most of our problems in a robust way, assuming a sane implementation and some guidelines. It would still need some hopefully simpler code to either up-front (3.) or adding tasks dynamically to solve the installing of bases and default-providers as needed
Sketch of what to consider to make task serialisation decisions
Define two sets of critical tasks: (because they affect current link/active state, need active snaps, are cross-snap):
- link-snap, unlink-snap, unlink-current-snap, setup-profiles, remove-profiles: because they affect current link/active state, are cross-snaps
- run-hook: because it requires active snaps (here we need to consider also ops on services, but letâs ignore them for the sketch)
For all non-ready changes find such tasks, consider the lane they are in (or the full change if there is no lane, aka lane 0), we get a list of critical lanes (snap, change, lane, flag whether it has tasks from the link-snap set).
For each of the critical lanes (ignoring some whitelisted tasks roughly corresponding to the prepare/download/validate part of an install or refresh), we have 3 possible states:
- ready: all tasks are ready
-
pending: all tasks are in
Do
state -
in-progress: not all ready, not all
Do
Given a task to run:
- if it doesnât belong to any critical lane or is part of the kind whilelist it can run
- if there are in-progress critical lanes, run it only if it belongs to one already or adding its lane wouldnât break the CONSTRAINTS
- if there are no in-progress critical lanes any task can run
For implementing CONSTRAINTSâ checks we need also to have guesses whether a given snap is an application with only plugs. This information helps only increase concurrency, the conservative assumption (that the snap is a base or has slots, or just for simplicity a special type) is always correct.