Restarting services from configure hook: race condition

I ran into an interesting issue tonight, and was curious if this is something we’ve considered.

I’m using the configure hook for a port number on which my snap listens. It defaults to 80, but one can change it with e.g. snap set mysnap port=81.

The piece of software listening on port 80 must be restarted in order for that new configuration to take effect. Since snaps don’t have access to systemctl, my solution has been to set services to restart-condition: always and then simply kill the daemon itself, relying on systemd to bring it back up.

The problem is, doing this in the configure hook causes a race condition. When the daemon comes back up, it determines the port it’s supposed to use using snapctl get port. But if systemd happens to bring it back up before the configure hook has exited successfully, since the configuration change happens in a transaction, the change hasn’t actually taken place yet, and the daemon comes back up with the old port.

How should I be restarting services in the configure hook to account for this? As far as I’m aware, even the ability to restart services via snapctl won’t solve this issue. Thoughts? The solution I’m currently working on is implementing a lock that will be checked by my services, but that seems a bit heavy.

Taking a queue from ansible’s handlers, it could be slick to queue up service restarts and only run them after the hook has completed. I suppose that would be implemented in snapctl as part of the services work.

3 Likes

We do similar tricks with sending signals to PIDs, in order to work around the current lack of service control inside apps/hooks.

However, we have avoided this problem by not having the daemon startup use snapctl get, and instead just read a config file that is written by the configure hook, so you control the ‘transaction’.

Not sure if that will help here, but in general, I’m a bit wary of relying on snapctl get as part of your services’ start up code.

That being said, queuing service restarts might be nice.

1 Like

I started out doing the same thing in the pre-2.27 days, when you couldn’t use snapctl outside of hooks. You definitely have some advantages with that approach:

  1. You don’t have to deal with this
  2. You don’t have to deal with the fact that there’s no way to determine which property changed in the configure hook, since you saved them in a file and can compare them. I actually save two values every time I set one with the configure hook: the actual value, and a previous value that I can compare against later :stuck_out_tongue:

But I really like the idea of using snapd to save my config state, and I feel like that’s the “snappy way” to do this, it’s just got a few rough edges. Conceptually, I’m curious why you’re wary of relying on snapctl get in your services? Note that the real services themselves (apache, etc.) certainly do not shell out to snapctl, I just set environment variables in wrappers which are then used by the services.

So I think my wariness in part comes from having a service that depends on some other non-trivial process in order to start my service correctly, which is the essence of the problem here.

I also have a general preference for config files over environment variables, in general, for a few reasons:

  1. they is one place to go to view a service’s config.
  2. they can be self documenting
  3. they can be statically verified before restarts, and abort if not correct. An invalid env var may not be detected until it’s already brought your service down.
  4. they allow for the application to support graceful restarts by reloading from that config file[1].

Env vars have their use for sure, but my default is to prefer config files for applications.

Additionally, I am unsure about snap config being the place for all config. My current thinking is that snap config should probably be for high level user config (ideally explicit, typed, and documented, like juju). Any concrete or derived config should probably be in files in $SNAP_DATA.

Thanks

[1] I’m aware that, in this case, changing ports usually requires a full restart anyway :slight_smile:

– Simon

I’ve discussed some aspects of this with @kyrofa on IRC, this is indeed a problem as the service that’s being restarted from configure hook doesn’t see the new values from snapctl get ..., due to transactional nature of configure hook.

Here is what has been discussed as possible solutions:

  1. A way for configure hook to queue commands that restart the service after configure hook succeeds, e.g. snapctl queue -- snapctl restart apache.
  2. A new post-configure hook executed after configure hook succeeds. It could implement logic for restarting services. This is problematic though as it would need to know what config values have changes and what needs restarting based on these values - cumbersome.
  3. Discussed briefly, clearly hypotetical and probably problematic if possible at all (mentioning just for the record): see if it’s possible for the restarted service to “see” the cookie/context env var of the configure hook transaction, so calls to snapctl get ... done by the service see new uncomited values. As indicated this has issues, e.g. it breaks transactional nature of setting a configuration (service may be running with a configuration that failed to apply), it eventually leaves the service with context/cookie which is supposed to be ephemeral (transaction gets commited eventually, context is gone).
  4. A way for snapctl to explicitely commit config changes when asked to, e.g. snapctl commit so that configure hook can control this aspect and restart services next with new changes in place. This feels a bit hacky though, configure is no longer fully transactional, complexity grows.

I hope that captures the points we discussed. Thoughts?

I think we should default to only change services if the configure hook actually works. We can easily do that by inserting the service change task in the same Change of the currently running hook, and make it wait for the hook itself. The snapctl command then returns immediately instead of waiting for the change, of course.

Later we can introduce a flag that undoes that, and forces the restart to happen immediately. But that’s the more dangerous and less sane behavior for a configure hook.

Let’s please try to do that in a general way instead of specializing the configure hook, and then only enable the behavior for the configure hook itself for now.

Ah, we support adding tasks to a change that’s in progress? That sounds like a good fit indeed.

A few clarifying questions:

  1. This of course depends on the snapctl service restart support. I think that’s already landed, correct?
  2. Would it make sense to only queue service restarts uniquely? e.g. if two configuration changes occurred that require a service restart, can we make sure that we only restart it once at the end, even if two snapctl restart apaches were called? That might simplify logic in the hook, and I don’t see an obvious downside

Yeah, unifying those is a good idea. Should be easy to do.

Thanks @niemeyer, nice idea! I’ll make it happen.

2 Likes

@pstolowski, @niemeyer, thank you both. That will be lovely.

1 Like

@pstolowski happy to review when you’ve got something.

PR https://github.com/snapcore/snapd/pull/4070

2 Likes

It seems I am hitting a case where this race condition is still present.

One of the services I am queuing for a restart (flanneld) has to have access to a few files to start correctly (eg /var/snap/microk8s/873/args/flanneld). The configure hook places these files in the right locations and does the queuing of the restart (flanneld has to wait for etcd and containerd https://github.com/ubuntu/microk8s/pull/649/files#diff-8f4e953fdcce135fef1df9e88717ab5dR288).

If you look at the logs below you will see that the first time flanneld starts with the right path. If it fails and systemd restarts it it switches to the yet uncommitted path where files like /var/snap/microk8s/873/args/flanneld do not exist yet.

Sep 16 15:22:36 ip-172-31-20-243 systemd[1]: Started Service for snap application microk8s.daemon-flanneld.
Sep 16 15:22:39 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: Error:  100: Key not found (/coreos.com) [4]
Sep 16 15:22:39 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: /coreos.com/network/config is not in etcd. Probably a first time run.
Sep 16 15:22:39 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: {"Network": "10.1.0.0/16", "Backend": {"Type": "vxlan"}}
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.327351    2093 main.go:514] Determining IP address of default interface
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.327682    2093 main.go:527] Using interface with name eth0 and address 172.31.20.243
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.327696    2093 main.go:544] Defaulting external address to interface address (172.31.20.243)
Sep 16 15:22:42 ip-172-31-20-243 flanneld[2093]: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.349349    2093 main.go:244] Created subnet manager: Etcd Local Manager with Previous Subnet: None
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.349473    2093 main.go:247] Installing signal handlers
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.360704    2093 main.go:386] Found network config - Backend type: vxlan
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.362613    2093 vxlan.go:120] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.374094    2093 local_manager.go:234] Picking subnet in range 10.1.1.0 ... 10.1.255.0
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.375058    2093 local_manager.go:220] Allocated lease (10.1.75.0/24) to current node (172.31.20.243)
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.378865    2093 main.go:317] Wrote subnet file to /var/snap/microk8s/common/run/flannel/subnet.env
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.378877    2093 main.go:321] Running backend.
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.378915    2093 vxlan_network.go:60] watching for new subnet leases
Sep 16 15:22:42 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:22:42.385354    2093 main.go:429] Waiting for 22h59m59.989196432s to renew lease
Sep 16 15:22:50 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: E0916 15:22:50.823603    2093 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: unexpected EOF
Sep 16 15:22:50 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: E0916 15:22:50.823632    2093 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: unexpected EOF
Sep 16 15:23:29 ip-172-31-20-243 systemd[1]: Stopping Service for snap application microk8s.daemon-flanneld...
Sep 16 15:23:29 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:23:29.865057    2093 main.go:370] shutdownHandler sent cancel signal...
Sep 16 15:23:29 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:23:29.865112    2093 main.go:437] Stopped monitoring lease
Sep 16 15:23:29 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:23:29.865122    2093 main.go:339] Waiting for all goroutines to exit
Sep 16 15:23:29 ip-172-31-20-243 microk8s.daemon-flanneld[2093]: I0916 15:23:29.865565    2093 main.go:342] Exiting cleanly...
Sep 16 15:23:29 ip-172-31-20-243 systemd[1]: Stopped Service for snap application microk8s.daemon-flanneld.
Sep 16 15:25:37 ip-172-31-20-243 systemd[1]: Started Service for snap application microk8s.daemon-flanneld.
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8329]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8329]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8329]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8329]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8329]: cat: /var/snap/microk8s/873/args/flannel-network-mgr-config: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Main process exited, code=exited, status=1/FAILURE
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Failed with result 'exit-code'.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Service hold-off time over, scheduling restart.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Scheduled restart job, restart counter is at 1.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: Stopped Service for snap application microk8s.daemon-flanneld.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: Started Service for snap application microk8s.daemon-flanneld.
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8859]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8859]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8859]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8859]: cat: /var/snap/microk8s/873/args/flanneld: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 microk8s.daemon-flanneld[8859]: cat: /var/snap/microk8s/873/args/flannel-network-mgr-config: No such file or directory
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Main process exited, code=exited, status=1/FAILURE
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Failed with result 'exit-code'.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Service hold-off time over, scheduling restart.
Sep 16 15:25:38 ip-172-31-20-243 systemd[1]: snap.microk8s.daemon-flanneld.service: Scheduled restart job, restart counter is at 2.

@kjackal I’m sorry for late reply, I had your question in a queue for a while…

Is this happening when you snap set ... with a new configuration value? Or is it misbehaving during snap refresh? Can you get snap changes and snap change <id> output (where id can be found from snap changes output, for the operation that is relevant to the problem)?