Health Checks

:construction: This is work in progress. :construction:

Snaps can provide a check-health hook that can be used by developers to signal to the system and the user that something is not well with the snap. Note the health is of the snap, not of the apps it contains; it’s up to the snap developer to determine how the health of the individual apps add up to the health of the snap as a whole.

The check-health hook is expected to use snapctl to inform snapd about the health of the snap:

snapctl set-health [--code=<error code>] <status> [<message>]

status can be one of

  • okay (which takes no message and no code),
  • waiting (some resource the snap needs isn’t ready yet, nothing for the user to do but wait message(+code) must explain what it’s waiting for)
  • blocked (the user needs to do something for the snap to do something; message(+code) must say what)
  • error (something went wrong; message+code must explain what broke)

There is another status that can’t be set by the snap directly:

  • unknown: the default, no hook provided or the hook did not call snapctl set-health (code snapd-hook-no-health-set), or the hook itself failed (code snapd-hook-failed).

code is optional, meant to be used by the snap or related tooling. It is not a number, but a word with 3-30 bytes matching [a-z](?:-?[a-z0-9])+. The snapctl set-health will fail if code is invalid. The snapd- prefix is reserved. code is not allowed when status is okay.

message is a freeform message (hopefully a full, properly capitalised sentence in English), that explains to the user what went wrong or what they need to do to unblock the snap. It should be at least 7 and no more than 70 bytes long, and will get truncated if the length is exceeded (but will fail if <7). message is required if status is not okay, and not allowed if status is okay.

The hook will be called periodically by snapd, so it’s important that it be fast, light, and idempotent; it will have a hard timeout of [TBD; 30 seconds?]. The user can trigger it manually with snap health [<snap>...]. It will also be called as part of any install, refresh or revert operation.

When a snap’s health was okay in one run, and in the next stopped being okay, a warning will be emitted.

snap list will mention the last-known status of any snaps that have the health check hook, if the status is not okay (or if the snap has a health check and the status is unknown). Similarly snap info will include the full health information (but only in verbose mode if okay or if unknown with no health check hook).

~$ snap list some-broken-snap
Name              Version  Rev  Tracking  Publisher   Notes
some-broken-snap  1234567  123  beta      canonicalâś“  error
~$ snap info some-broken-snap
# ...
health:
    status: error
    message: Something went wrong. # not a very good message
    checked: 4 minutes ago
    code: oh-noes # Optional
# ...
~$ snap health
Snap              status   code     Message
a-nice-snap       okay     -        -
jump-drive        waiting  -        Alcubierre drive spooling up.
some-broken-snap  error    oh-noes  Something went wrong
4 Likes

this would be overwritten the next time we call the hook? not sure it makes sense initially without deeper thinking on it

correct

we should clarify that the hook needs therefore to be quick/idempotent/not resource intensive.

Also that those exact intervals are an internal detail and might vary down the line.

@pedronis what is the health of a snap that provides a health-check hook that doesn’t actually call snapctl set-health?

Typo:

In the case of a snap with multiple apps, is there only one health? Is the snap expected to do the multiplexing itself and use error code to signal?

yes, the health is about the snap, not the individual apps it comprises.

Will snapd force a timeout on the hook? “it’s important that” is akin to “you should” which means, no-one does because it’s not enforced :smiley:

Will snap list actively run the health-check for installed snaps, or will it use the last-known status?

Is the status associated with a snap revision? If I have a snap in error, then I snap switch to a different revision which doesn’t have a health check, what status will my snap have?

last-known status

no

snap switch does not change the current revision, so no change.

1 Like

yes, all hooks are run with a timeout (there’s a long default but we can use a shorter one here)

1 Like

no, health is not associated with a revision but any refresh-like operation (switch is not one of those though) will try to run the hook, if there’s is no hook the status goes to unknown

1 Like

I would say some variant of unknown

Are there plans to expose setting the health of a snap outside of the snap via something like snap set-health (and a supported associated snapd REST API endpoint)? This is something we have had customers ask about, but currently don’t have many use cases for, so just curious at this point.

if there is demand for it, snap set-health <snapname> <args as per snapctl set-health> would be the natural way of doing it.

1 Like

Any update on making this feature available?

1 Like

some parts of it already are! which parts were you needing? :slight_smile:

Currently (as of snapd 2.44) it seems any (non-root) user can run snapctl set-health inside a snap run --shell and thus change the health status of a snap.
Is that intended?

1 Like

yes, that’s the state of things. Some snaps do not run as root except for their hooks. OTOH health checks is not a complete feature yet, we might have to track or use health set by root vs non-root differently in the end. If the intention is to set health authoritatively I would recommend if possible to do it as root.

Is there a test snap in the store that sets a health status?

I’m not sure about the store, but there is a basic test snap in the snapd source tree:

If you’ve got a checkout, you could use snap try on that directory to experiment with the feature.

1 Like