In the managers and overlord we use the following tools to make sure operations don’t interfere with each other or produce broken state:
locking: State.Lock and State.Unlock where the latter also serializes the state to disk, so that it can be picked up across restarts, crashes etc and operation can continue
when an operation spans multiple tasks we use conflicts and checks for pre-existing pending/running tasks/changes to make sure for example we don’t have concurrent operations on the same snap (see e.g. snapstate.CheckChangeConflict and devicemgr.changeInFlight)
we also use TaskRunner.AddBlocked to add predicate functions to the single shared TaskRunner to control the serialisation of tasks, for example ifacestate conservatively serialises all of its tasks because they can touch multiple snaps
Generally we prefer correctness over concurrence, so in general tasks that are mutating global shared parts of the state should take the State lock once. Exceptions that release the lock around possibly slow operations should avoid touching shared global bits at all if possible, or use serialisation otherwise carefully and sparingly.
The snapstate.CheckChangeConflict is mechanism is pluggable through snapstate.AddAffectedSnapsByAttr (preferably) and snapstate.AddAffectedSnapsByKind.
ATM as discussed with @niemeyer snapstate especially doesn’t follow “Generally we prefer correctness over concurrence” but has quite a few tasks that proceed updating the SnapState of a snap across unlocking and relocking the state, just trusting the conflict mechanism to protect this, which is something that it is a bit too implicit to trust going forward. We decided we want to fix that and I will prepare a PR making those tasks take the lock only once. If we hit real bottlenecks we will revisit this possibly using serialization instead as discussed.
@chipaca There’s probably some misunderstanding about what is being proposed above. The proposal for the time being is simply to lock a bit differently, including some relatively fast disk operations that we used to unlock on. That’s what is meant by prioritizing simplicity (implying easier correctness) rather than performance on those cases.
but if a task Get and then Set the SnapState of a snap it must avoid releasing the state lock in between, other tasks might have reasons to update the SnapState independently:
// DO NOT DO THIS!:
snapst := ...
snapst.Attr = ...
st.Unlock()
...
st.Lock()
Set(st, snapName, snapst)
if a task really needs to mix mutating a SnapState and releasing the state lock it should be serialized at the task runner level, see SnapManger.blockedTask and TaskRunner.SetBlocked
@niemeyer st.RequestRestart(state.RestartDaemon) is just daemon tomb.Kill(nil) through a bunch of abstraction indirections, is not waiting for the restart at that point . The system restart case is os.Execing shutdown from the call path but we want to change that, it’s also an async request though, not waiting for the actual shutdown.
Over this now merged PRs we have switched to a model where there is a single TaskRunner created by Overlord and shared/used by all managers:
Main motivation is that the plans to fix Issue with using snapstate Active for interface repository will require sharing more responsibilities across some SnapManager and previously InterfaceManager tasks (collapsing setup/remove-profiles and *link-*snap), in such a way that atomicity enforced through blocked predicates will need to span over tasks from both, which is very awkward unless the tasks all run under one single TaskRunner.
Blocked predicates are now cumulatively added by managers to this single TaskRunner via AddBlocked. The running of a task is naturally blocked if any of these predicates returns true.