Systemctl service management unification

ijohnson · October 21, 2019, 8:55pm

FYI this is a development post about snapd internals and isn’t probably interesting if you don’t work on snapd itself…

Problem statement

Currently, we have at least 3 situations where snapd code will interact with systemd to control services/daemons.

When installing/removing snaps and we have to standup the systemd units and start services for the first time (and stop them when it comes time to update).
When a snap user is managing services and runs something like snap stop.
When a snap itself (i.e. the snap author writes this) wants to manage it’s own service and runs something like snapctl stop.

The big problem is that while the codepath for 2 and 3 is pretty much the same (they both end up back in servicestate overlord and create service-control changes with exec-command tasks), the codepath for 3 is completely different and goes straight to the wrappers/systemd package without creating any sub-tasks, etc. This means we have tasks like stop-snap-services in the snapstate manager which operates in a completely different manner from service-control in the servicestate manager w/respect to how things are told to stop.

There is a desire to unify the implementations to make them follow mostly the same codepath, at least when it comes to working with systemd using the systemctl command because of various wrinkles in working with systemd. Exacerbating this situation is the disparity in how the 2 methods are tested, for one we have a systemctl function which is run in Go and needs to handle the args passed to it via Go function mocking, while for other tests we have to mock systemctl as a shell function on the system because running that binary is hard-coded inside wrappers/systemd package.

Status Quo

Externally, mechanisms 2 and 3 will create a service-control change with exec-command tasks with the systemctl binary + args just passed in literally to the exec-command task. This is unfortunate because we already have many high level functions in the systemd package which handle this in a more high level and are more consistently mocked, etc. Mechanism 1 however has a few different changes (i.e. install-name, refresh-name, etc.) that create stop-snap-services or start-snap-services tasks which call into the aforementioned systemd package. These tasks are fine right now AFAICT.

Deep dive into the codepaths

This can be skipped if familiar with the code, but is here for posterity

Specifically for 1, in cmd_snap_op.go, cmdInstall.installOne will call into client.Install, which eventually posts to /v2/snaps/<name> on /run/snapd.socket. From there the daemon picks up in daemon/api.go with postSnap and eventually will get to snapstate.Install, which after running around a bit will call stopSnapServices which itself will go call backend.StopServices, … which snakes down into wrappers.StopServices and finally builds up the systemctl args using wrappers/systemd.go

Specifically for 2, in cmd_services.go, svcStop.Execute will call into cli.Stop, which posts to /v2/apps on /run/snapd.socket. From there the daemon picks up in daemon/api.go with postApps and will work next into servicestate.Control which finally will build up the systemctl args and run them with cmdstate.ExecWithTimeout.

Specifically for 3, in cmd/snapctl/main.go, run() will call into cli.Runsnapctl which posts to /v2/snapctl on /run/snapd-snap.socket. From there the daemon picks up in daemon/api.go with runSnapctl and will work next into ctlcmdRun, which then winds back into stopCommand.Execute from ctlcmd package, and finally ends up at the same place with servicestate.Control and then follows 2 from this point.

Proposals

I propose the following steps to alleviate the situation:

service-control changes no longer create exec-command tasks and instead create start-snap-services and/or stop-snap-services tasks (possibly with enable a sub-case of start-snap-services and disable a sub-case of stop-snap-services)
start-snap-services and stop-snap-services start tracking internally to the task what services are to be started/stopped, etc. (implementation detail: if this isn’t specified in the task, then do the right thing and assume “all” with the current behavior so that we can leave as much as possible in snapstate_test.go well enough alone )
Refactor relevant tests to use “the one way” to mock systemctl as needed (probably the mock shell script way I think)

Another alternative which would probably be more work would be to:

Remove the start-snap-services and stop-snap-services tasks altogether, replacing them with a (new) kind of task called service-manage or something specific to managing services
Refactor the service-control change to use this new service-manage task

There may indeed be more alternatives, but this is what I’ve come up with.

ijohnson · October 21, 2019, 8:55pm

CC @pedronis and @mvo

pedronis · October 22, 2019, 3:26pm

given an optional “services” parameter attached to those tasks (missing/empty meaning all) and go from there seems fine to me.

ijohnson · October 22, 2019, 6:32pm

To be clear, this is exactly what I was referring to

chipaca · October 22, 2019, 8:36pm

Your main proposal seems sane. The only issue I see is that of how state would behave on revert. Thoughts on how you’d handle that?

ijohnson · October 23, 2019, 12:14am

Hmm, regarding revert I will note that the only things in state that would be problematic is:

service-control tasks might exist in state with “Doing” when we refresh to the new version
{start,stop}-snap-services tasks might exist in state with data that isn’t understood when we revert back to an old version (and also these tasks might be in unexpected places but it seems this would be handled fine)

Which leaves us with the following situations:

If we are refreshed to the new version of snapd during a change like installing a snap, we would only be in trouble for service-control changes which were in flight, which could be triggered by snapctl usage during a hook run. IIUC, to handle this situation we would need a patch which during snapd startup would change these service-control tasks to start-snap-services or stop-snap-services tasks depending on the systemctl args in data.argv in the task (also inserting into the new tasks the name of the service, again deduced from the data.argv in the old task)
If we are refreshed to the new version of snapd during a change like service-control that’s not tied to a larger change like installing a snap, we would still do the same as 1 AFAICT and do a patch to state upgrading the service-control tasks to {start,stop}-snap-services tasks
If we successfully refresh to a new snapd and then start a new change like installing a snap and then during that change snapd gets reverted to the previous version, we would have a problem because now we just have these {start,stop}-snap-services tasks in seemingly odd places but the larger issue is that snapd would do the wrong thing and implement the current behavior for these tasks and operate on all services in a snap, when in fact that task in the new snapd could have just been for a single service (or it could have been to disable/enable and now it just gets started/stopped and not enabled/disabled).
If we successfully refresh to a new snapd and then start a new change like stop-snap-services from something like snap stop name.svc and then during that change snapd gets reverted to the previous version, we would have the same problem as above where all of the services would be started to stopped instead of a specific one.

While 3, and 4 are unfortunate (and indeed wrong), I don’t know what we could do about this, and I would say that in the very least these bugs would only affect snaps and their services, and not affect snapd itself being unable to operate. One way to solve this would be to first patch {start,stop}-snap-services tasks to understand the task data about which services to operate on and do the right thing, and then at least if we got reverted to this version the right thing would happen. However I’m not sure how we can ensure that we revert to that version because for example snapd could refresh from an even older version to the newest one and wouldn’t be able to first go to this middle, overlapping version. If we had epochs for snapd itself, that would be able to solve this problem, but AFAIK I don’t think we have that.

Another option to fix that would be if we have the inverse of patch (i.e. something that runs with the new snapd before we go back to the old snapd) we could patch the new tasks to the equivalent old ones. I don’t know such a mechanism exists beyond snap-repair which seems to be as minimally invasive as possible to just get snapd to start again. I’m not sure if this bug warrants expanding the scope of that tool. Also that tool currently only runs on UC18, so again either the scope of that tool would have to expand to work elsewhere.

Yet another option would be to remove stop-snap-services and service-control tasks replacing both of them with some new named task which operates like {stop,start}-snap-services does today with the modifications from my proposal, and then when we got reverted, snapd would at least not do the wrong thing I explained above for these tasks, it would instead just do nothing (which may be more wrong I don’t know). I personally don’t like this specific idea though

ijohnson · November 27, 2019, 3:02pm

After some discussion about options here we decided to do the following:

Add a new task to the “service-control” change, call it service-control for now, that does the right thing
Leave the old task exec-command in the change, but add new “ignore” flag (name TBD) so that if the new snapd sees this task it does nothing, but if the old snapd sees this task it does the old right thing

Furthermore, the new service-control task will have some parameters to it that need to be design a little bit:

it will have the name(s) of services as a parameter for the task
it will have the action as a string to do for all of the services in the parameter of the task (i.e. disable all these services or stop all these services, etc.)
actions like disabling just a service, but not the socket or disable just the socket, but not the service will be enabled by somehow setting the name of the service specially. For example if we have a service svc1 with a socket, with the socket named sock1 and we want to disable both we could have the services list in the task like ["svc1"] and if we wanted just the socket we would do something like ["svc1.sock1"] and if we wanted just the service we would do something like ["svc1.service"]. The specific mechanism may change, but this is the idea

Finally, we agreed that $SOME_DAY, when snapd can do epochs for itself, we will move snapd up to the next epoch and drop the old task, such that if going to the new snapd fails, we can rollback state somehow. This is far in the future, so this is more just a comment on what we would like to do eventually.