[rfc] Health Checks

upcoming
chipaca

#1

:construction: This is work in progress. :construction:

Snaps can provide a check-health hook that can be used by developers to signal to the system and the user that something is not well with the snap. Note the health is of the snap, not of the apps it contains; it’s up to the snap developer to determine how the health of the individual apps add up to the health of the snap as a whole.

The check-health hook is expected to use snapctl to inform snapd about the health of the snap:

snapctl set-health [--code=<error code>] <status> [<message>]

status can be one of

  • okay (which takes no message [and no code?]),
  • waiting (some resource the snap needs isn’t ready yet, nothing for the user to do but wait message(+code) must explain what it’s waiting for)
  • blocked (the user needs to do something for the snap to do something; message(+code) must say what)
  • error (something went wrong; message+code must explain what broke)

There is another status that can’t be set by the snap directly:

  • unknown: the default, no hook provided or the hook did not call snapctl set-health (no code), or the hook itself failed (message will say hook failed and code will be set to snapd-hook-failed).

code is optional, meant to be used by the snap or related tooling. It is not a number, but a word with 3-30 bytes matching [a-z](?:-?[a-z0-9]){2,}. The snapctl set-health will fail if code is invalid. The snapd- prefix is reserved. code is not allowed when status is okay.

message is a freeform message (hopefully a full, properly capitalised sentence in English), that explains to the user what went wrong or what they need to do to unblock the snap. It should be at least 7 and no more than 70 bytes long, and will get truncated if the length is exceeded (but will fail if <7). message is required if status is not okay, and not allowed if status is okay.

The hook will be called periodically by snapd, so it’s important that it be fast, light, and idempotent; it will have a hard timeout of [TBD; 30 seconds?]. The user can trigger it manually with snap health [<snap>...]. It will also be called as part of any install, refresh or revert operation.

When a snap’s health was okay in one run, and in the next stopped being okay, a warning will be emitted.

snap list will mention the last-known status of any snaps that have the health check hook, if the status is not okay (or if the snap has a health check and the status is unknown). Similarly snap info will include the full health information (but only in verbose mode if okay or if unknown with no health check hook).

~$ snap list some-broken-snap
Name              Version  Rev  Tracking  Publisher   Notes
some-broken-snap  1234567  123  beta      canonical✓  error
~$ snap info some-broken-snap
# ...
health:
    status: error
    message: Something went wrong. # not a very good message
    checked: 4 minutes ago
    code: oh-noes # Optional
# ...
~$ snap health
Snap              status   code     Message
a-nice-snap       okay     -        -
jump-drive        waiting  -        Alcubierre drive spooling up.
some-broken-snap  error    oh-noes  Something went wrong

#2

this would be overwritten the next time we call the hook? not sure it makes sense initially without deeper thinking on it

correct

we should clarify that the hook needs therefore to be quick/idempotent/not resource intensive.

Also that those exact intervals are an internal detail and might vary down the line.


#3

#4

@pedronis what is the health of a snap that provides a health-check hook that doesn’t actually call snapctl set-health?


#5

Typo:

In the case of a snap with multiple apps, is there only one health? Is the snap expected to do the multiplexing itself and use error code to signal?


#6

yes, the health is about the snap, not the individual apps it comprises.


#7

Will snapd force a timeout on the hook? “it’s important that” is akin to “you should” which means, no-one does because it’s not enforced :smiley:


#8

Will snap list actively run the health-check for installed snaps, or will it use the last-known status?

Is the status associated with a snap revision? If I have a snap in error, then I snap switch to a different revision which doesn’t have a health check, what status will my snap have?


#9

last-known status

no

snap switch does not change the current revision, so no change.


#10

yes, all hooks are run with a timeout (there’s a long default but we can use a shorter one here)


#11

no, health is not associated with a revision but any refresh-like operation (switch is not one of those though) will try to run the hook, if there’s is no hook the status goes to unknown


#12

I would say some variant of unknown


#13

Are there plans to expose setting the health of a snap outside of the snap via something like snap set-health (and a supported associated snapd REST API endpoint)? This is something we have had customers ask about, but currently don’t have many use cases for, so just curious at this point.


#14

if there is demand for it, snap set-health <snapname> <args as per snapctl set-health> would be the natural way of doing it.