Hook timeout mechanism not working in 2.24

While checking another thing I ran into a case where 2.24 snapd “ran” a broken configure hook for 10 hours. I’m now investigating why this is happening since 2.24 already had timeout support for hooks.

You can fetch a Debian 9 qemu image from https://spread.zygoon.pl/ and you can couple that with https://github.com/zyga/snapd/tree/feature/debian-9-qemu to run the test.

Note that you should not just get master, fetch my remote and switch to this commit https://github.com/zyga/snapd/commit/2eda8023ca28060e5a027822969399bfe89ee508

Subsequent changes now merged into master mask this issue.

To reproduce:

git checkout 2eda8023ca28060e5a027822969399bfe89ee508
spread -debug -v qemu:debian-9-64:tests/upgrade/basic
1 Like

So @chipaca brilliantly found two bugs:

  1. Old snapd (like 2.21) did not set the timeout for configure hook so any new snapd (e.g. 2.24) will still hand forever. This can be solved with a patch-like approach that we took for several of the recent issues with interfaces.
  2. Snapd does not restart back into the old core snap when installation of a new core fails. In the test above, snap abort 1 means we are running snap 2.21 but snapd 2.24 until the service is restarted. We seem to be missing a maybe-restart in the undo path.
1 Like

Instead of patching, perhaps we can take 0 to mean default timeout, since we never want a script to actually run forever.

Yeah, I think that’s easier. I’ll make that happen.

currently only config has a timeout; should we give the others a timeout as well? of how much?

What is the timeout for the configure hook? Is that documented? (e.g. I don’t see it here).

5 minutes by default:

1 Like

@chipaca We can start with something like 10 minutes and see how that goes.

What happens if the timeout expires? Is it considered a hook failure (and will it cause a rollback)?

I ask because I’m writing one that needs to wait for all services to fire up, but on low-powered devices this can take quite a long time (e.g. mysql databases being generated, etc.)

Yes, hopefully it does mark the hook as failed since we’re killing it.

Having a hook that is stuck for such a long time seems like a bad practice. For anyone observing the system having a stuck script with no feedback looks like something is broken. Are there any alternatives?

I have two use-cases:

  1. Unless things have changed, the configure hook is the only place I can run a health check on my services. To do so requires that I wait until the services in question are up.
  2. I’m adding support for changing the configuration of a given service, but after the configuration changes the service needs to be restarted. If the hook runs and the service has not yet started, I have no way of knowing what state the service is in (perhaps it’s already running but has not written its pidfile yet). The only way I can be sure the configuration applies is if I wait for the service to be up, and then restart it.

I see no obvious alternatives, but I’m open to suggestions.

I will need to think about this some more, but as preliminary feedback, waiting indefinitely for a service to be up sounds like a bad idea. Think about the feeling of someone on the terminal waiting for “snap set” to confirm that the configuration was applied, or someone installing the snap waiting for 10 minutes for it to finish. That’s unacceptable, IMO.

Agreed, but like I said, I see no obvious alternatives.

The obvious alternative is to simply not do it. If it’s unreasonable to expect for the service to be up because it takes too long, then don’t wait.

If the only way the configure hook could be run was by the user, simply saying “Please wait for <service> to be running” and exiting non-zero would do it. But since that’s not the only time that hook is run, such a thing would cause a rollback.

Having a configure hook not doing either of these things results in a configure hook that potentially doesn’t apply a config, which seems similarly problematic.

Again, this seems straightforward rather than problematic. It’s unreasonable to expect anyone to issue a configuration command and then look at a screen for 10 minutes until it decides whether this is good or not. So the problem is sorted: it can’t wait without it becoming a terrible user experience. Anything else is fine: do whatever pre-checks it can do to validate the configuration, wait for a short little while, spawn a process that waits in the background until its time to apply the configuration, provide a command that checks whether the configuration is good, etc.

In the future we can also provide recurring health-checks which allow the service to report issues once it’s able to, but that won’t happen today or tomorrow.

It would be good for snap developers to know that their configure hooks’ execution time is limited, and that exceeding that time will potentially result in a rollback. Where is the best place to document this?

I’ve ported the old and vague wiki page into a slightly more comphensive forum post.

I’ve proposed https://github.com/snapcore/snapd/pull/3282 that adds a default timeout for all hooks (also for the case where configure hook task was created by older snapd which didn’t have Timeout support in hooksetup) and also increased configure hook timeout to 10 minutes. Please let me know what you think.

1 Like