Cross-snap operations, bases, and concurrency

pedronis · May 30, 2018, 11:51am

Operations on snaps are split over many tasks which can be chained for a single snap and operation, but overall would run concurrently, OTOH we need to guarantee consistent state both in memory and on disk for snaps.

A previous discussion and sharpening of tools relevant to that was here: Transactionality, locking and other concurrency coordination

So far the main mechanism to guarantee state consistency when splitting operations over many tasks has been to use conflict checks before creating the tasks and changes at all:

at most one in-progress change can touch a given snap at any time (as we have grown task kinds and tasks can be contributed by many state managers the checks have now become somewhat needlessly complicated and the listing of relevant task kinds is fragile and may be wrong)

manipulating interface security profiles and connections has always been problematic from this point of view, because for its very nature it always touches more than one snap state at a time. The current/previous solution to that problem was to make sure at most one task operating on those is running at any time (ifacestate uses SetBlocked for this).

Challenges from recent features

A series of recent features and enhancements have stressed these old solutions and approaches

base snaps
we added code to dynamically install as needed bases or default-providers for content for a snap when it is being installed or refreshed
we added more and invoke hooks more liberally
WIP interface hooks mean that connect, disconnect and auto-connect are now split over many tasks and hooks, and aren’t covered by a single self-contained task anymore, also it means further that operating on a snap can now trigger hooks on other snaps

These adds the following requirements:

at high-level the base of a snap needs to be installed and active (current link set and stable) before the snap itself is installed
the default-provider snaps of a snap should be present if possible and active before and during the auto-connect of that snap
during the execution of a hook the hook’s own snap, its base, the core snap (soon the snapd snap) must be active (snap-confine needs to consult and follow their current link)
same for when starting services of a snap (and similar)

Ad hoc solutions and problems

ATM for those features and to cover some of the requirements we introduced the following ad hoc approaches but still with open problems:

To install the base or default-providers for content of a snap, if not present yet, we add install tasks dynamically to the current change; after they are added current conflict logic avoids further changes to be started on those snaps. (The new tasks for multiple snaps need to be added in a all-or-nothing fashion because they are equivalent to taking multiple locks because of the conflict logic.)
Given that the presence check is done from a task, not up front, there can be concurrent operations that could influence it, so we have logic that checks for pending operations that can influence the active state of the relevant snaps and also for conflicts when we try to add the install operations. In both cases the task-generating task just retries (returns a Retry error) until it gets executed without interference. Notice here that if the conflict checks as they are currently are missing a relevant task kind the outcome might be fragile, flap for example if an undo is triggered.
Auto-connect also has similar challenges, as is now implemented as a task adding further connect required tasks and hooks and doing a best-effort of making sure the relevant snaps are active. Differently from bases and default-providers is in general hard to know up front which snaps will be involved in auto-connect.
Hook execution at the moment doesn’t do anything to make sure the relevant snaps are active. Usually the hook’s own snap is guaranteed indirectly by the general conflict checks, this is not true though for the snaps on the other side of an auto-connect. Also nothing is done about the base or core (soon snapd snap) being active if they were already present before the change or early for install/refresh changes. We have added chaining ensuring that for a multiple-snap refresh core and bases are operated before the dependent snaps, this doesn’t cover all scenarios though.

We have also a higher-level modeling annoyance, we have first-class Change and Task, but changes have only informative kinds (at least that’s how we have used them so far) and can bundle operations over many snaps, tasks have small granularity and can and are combined in different complex ways. We don’t have a first-class entity corresponding to something like “installing” or “refreshing” a single snap, we have introduced lanes but they are not first-class and mostly deal with abort logic. For example there is no cheap/direct way to ask whether a lane is ready or to know what the lane is accomplishing at a high-level. (OTOH we also support putting the same task in many lanes but that feature is AFAIK unused currently).

Ideas

Some possible suggestions about directions for simplifying things, addressing the problems in a more robust way and reducing the places in code that need to care:

Simplify the conflict logic so not to depend on task kinds anymore, but follow the spirit of at most one in-progress change explicitly touching any given snap.
We should prevent to remove (possibly also disable unless forced, though we don’t have --force kind of flags yet anywhere) bases that have snaps using them.
We could move to an up-front approach for bases and default-providers installation to be added to an install or refresh change. Making this work would need to extend conflict checks also to cover the bases of the snaps operated on.

That alone would not be enough, for multi-snap operations we would need to use order (as we do for multi-snap refreshes already) so that bases and default-providers are operated on first. While bases cannot, default-providers could form chains or even loops and that would need some care (topological sorting, breaking ties somehow).

This means though that while now starting simultaneously two changes installing two different snaps that need the same not-yet-present base can be done, in this approach it would give a conflict.

The issue of auto-connect and robustly running of hooks for the other side of auto-connection is not solved by this though. So while something to consider it’s not a full solution.
We could teach specifically to the hook runner task to wait on pending active state toggling operations for the relevant snaps and further also block (using in-memory only structures given that hook running is one task should be enough) the creation of more of such operations. The cost/benefit of this is unclear though. The most problematic hooks are the interface hooks for the other implicit side for the auto-connect case, if the operation we waited on was a removal, all we can do is fail and is already too late: we should not have scheduled the hooks and connect task to start with.
Broadly speaking we need to accept that running through the unlink=>setup-profiles=>link=>auto-connect=>other-ops-that-could-cause-undo sequence for a base snap cannot happen at the same time as we operate on a snap using that base. Same soon for the snapd snap. Same for a snap that could be involved in an auto-connection.

In a first approximation we could add code that stops such sequences (and similar that affect the current link of snaps) to happen concurrently or concurrently with running hooks.

With current tools that could be done with some shared blocked predicate to be used with all task runners’ SetBlocked. This predicate would need to look at all changes to make its decisions.

Downloads would still be concurrent because they happen before the unlink of the snap. Copying of snap data wouldn’t because it is after.

To solve that and regain most concurrency with a bit more complexity we could consider this two different sets of sequences of operations and following CONSTRAINTS:
- at most one current-link affecting operation sequence on bases (soon also snapd snap), and snaps with slots
- otherwise many operations running hooks, current-link affecting operation sequences for application snaps with only plugs
Under the CONSTRAINTS then the prerequisite snaps for hook running, auto-connect etc. would be active and stably so.

Without going into details, the blocking predicate (see below for an initial sketch) here would probably be somewhat expensive, about the latter we can use caching, and usually recheck the last time observed situation (a set of tuples of roughly (snap, change, lane)) first.

As hinted before some complexity in doing this would come from the fact that something like “installing”/“refreshing” a single snap doesn’t have right now a first class representation in the system. Lanes as we use them right now come the closest.

Concretely:

(1.) and (2.) are things we should, I think, definitely do
(4.) was worth mentioning but the cost/benefit does not look attractive
(3.) is worth thinking about
(5.) cost/benefit is not completely ideal because of our current tools/modelling but is worth exploring and would solve most of our problems in a robust way, assuming a sane implementation and some guidelines. It would still need some hopefully simpler code to either up-front (3.) or adding tasks dynamically to solve the installing of bases and default-providers as needed

Sketch of what to consider to make task serialisation decisions

Define two sets of critical tasks: (because they affect current link/active state, need active snaps, are cross-snap):

link-snap, unlink-snap, unlink-current-snap, setup-profiles, remove-profiles: because they affect current link/active state, are cross-snaps
run-hook: because it requires active snaps (here we need to consider also ops on services, but let’s ignore them for the sketch)

For all non-ready changes find such tasks, consider the lane they are in (or the full change if there is no lane, aka lane 0), we get a list of critical lanes (snap, change, lane, flag whether it has tasks from the link-snap set).

For each of the critical lanes (ignoring some whitelisted tasks roughly corresponding to the prepare/download/validate part of an install or refresh), we have 3 possible states:

ready: all tasks are ready
pending: all tasks are in Do state
in-progress: not all ready, not all Do

Given a task to run:

if it doesn’t belong to any critical lane or is part of the kind whilelist it can run
if there are in-progress critical lanes, run it only if it belongs to one already or adding its lane wouldn’t break the CONSTRAINTS
if there are no in-progress critical lanes any task can run

For implementing CONSTRAINTS’ checks we need also to have guesses whether a given snap is an application with only plugs. This information helps only increase concurrency, the conservative assumption (that the snap is a base or has slots, or just for simplicity a special type) is always correct.

niemeyer · June 20, 2018, 12:59pm

Thanks for writing that down, Samuele. Great analysis.

How would that work when there are dynamically injected tasks on the change? Perhaps similar to how we do it today (bail and retry until it’s safe)?

The overall simplicity of that approach is appealing. We need to evaluate whether this would change the user experience in any meaningful way.

Indeed we can’t remove them. Disable feels minor, though… do we have any logic using bases that depends on it being active?

That would mean the base would stay in even if the snap installation fails, for example, which seems awkward.

Those feel somewhat complex, and thus makes it feel somewhat prone to further issues. We should start from a situation of stability that we can very easily reason about.

Fully on board.

pedronis · June 20, 2018, 1:55pm

Yes, just with this change we still need the wait code (only (5.) avoids some of that because it introduces invariants where we know which snaps can have or not have unlinked current at any given point of other ops happening).

I don’t think it makes a big difference for the user. It mostly would reimplement what we have already, just with a few more conflicts. The effect is larger if you do (3.) because of the upfront conflicting for bases etc.

As I wrote originally, for example any hook running for a snap with a base needs the base to be active (current link set).

This actually already happens because of lanes usage though (the base installation is in a different lane than the original snap I think, even more so if the base is being installed in a different change). .

I agree to some extent, I would expect (5.) to be a bit simpler in practice that it sounds described here, especially if we can assume one task, one lane at most which is actually what we do already afaict. OTOH I have no clear solution except something like (5.) for the issue between bases and hooks for example, because as I explain originally waiting in the run-hook itself is a bit too late in some cases, also waiting is fragile against undo unless we make it more complicated and similar to (5.) but all over the place.

Either way I think we need to start cleaning up things in (1.) direction and then explore in code (5.) and answers to (3.).

pedronis · June 28, 2018, 10:17am

To recapitulate at the moment we are deploying the following:

general conflict checks between changes based on task kinds (this is fragile and has extensibility issues as we are seeing with the new snapshotstate work)
blocked predicates for tasks on various TaskRunners (snapstate, ifacestate, hookstate)
complex task injection and checks for pending tasks for relevant snaps and wait (by Retry) in the prerequisites task. Those checks as they are, are fragile against undos.
complex task injection and check pending tasks for relevant snaps and wait (by Retry) in the auto-connect task (and similar other ifacestate tasks). Same issues with undos.

OTOH we still don’t have a solution to ensure for run-hook tasks that the relevant snap, base, and snapd snap are active.

This is a quite complex picture and arsenal of tools to reason about.

It worth recalling that cross-snap issues arise in general between a snap and its base, the bootable base, and new snapd snap, and also other snaps providing slots for its plugs.

I think two possible consistent approaches to simplify things would be:

blunt conflicts (a combination of (1.)+ (3.) plus the remark about the source of cross-snap issues): make any snap change conflicts with changes on the snap, or on any base, snapd or any snap with slots. Add installing missing bases and default-providers up-front to changes. For multi-snap changes then use ordering/dependencies to solve in the intra issues, as we do in UpdateMany. This would still need the relatively simple blocked predicates we have now but that’s ok.
If blunt conflicts are too annoying, provide insufficient concurrency: pursue (5.), one interesting insight there, is that once we have just one TaskRunner (something that we need/are planning anyway) we can compute and pass to the blocked task predicate a list of in-flight lanes as well, which would make a good base for (5.) logic. For clarity in this world the complexity would move to a general blocked task predicate that would classify and decide which lanes can run together and which can’t. The general conflict check would care only about conflicts between a snap and itself across changes. prerequisites would stay but be a bit simpler. auto-connect would still inject tasks but should be able to be relatively naive otherwise.

For both plans we need to keep track of the snap type and whether it has slots/plug only, at least for changes that link or unlink the snap. We are doing the former already very directly in some cases (set in SnapSetup), we just need to extend that.

pedronis · July 17, 2018, 7:21am

First step of streamlining conflict detection:

this also makes the logic pluggable through snapstate.AddAffectedSnapsByAttr, snapstate.AddAffectedSnapsByKind

pedronis · August 14, 2018, 11:26am

@robert.ancell hi #5502 added a specific new error kind for conflicts between operations (which means the op could be retried later): snap-change-conflict, likely something for snapd-glib to support

geekorous · October 31, 2021, 1:09am

Hi. Even though it has been a long time since the last activity in this thread, I feel my issue is related to this discussion. I have a CI system that for the sake of testing Snap packages will try to install them on the designated machines and then run the automatic tests that are provided. There is a high chance that two Snaps would end up in the same machine at the same time, due to the parallelization involved. As result, there would be potential conflicts in snapd in terms of connecting the plugs or installing/removing the Snaps. Then the scripts are returned with conflict errors and we end up with a stable yet undesirable state which is most likely nothing has been done.

Currently, as a workaround, I have wrapped parts of the script related to installation, removal, and plug connection in between retries with random periods of sleep, yet I feel this couldn’t be the RIGHT way. Since the operations of our interest have no semantic conflict (i.e. they are essentially related to completely independent Snaps, the only conflict is on Snapd that has the slots for the plugs or should do operations during installation or removal), if somehow Snapd would show a blocking behavior in which the request would be pending until either it would succeed or fail due to other reasons than conflict (or maybe timeout), then I don’t need to actually take care of the retries anymore and my pipelines would be simpler, faster, and more efficient.

Is there any way to enforce Snapd to behave like this or better ways to handle concurrency at this level? I would be happy to go into more details if you’re interested.