Health checks
- Hook named “check-health” (or similar) is called by snapd to check for health
- Hook needs to be idempotent so snapd can call it at any time
- Hook might be called every 5 minutes to begin with, maybe? We can always increase later (but we can’t easily reduce it, without risking breaking applications that expect having more time for it)
- Have special status “unknown” if the snap hasn’t bothered to define it
- If the health-check hook is called and it doesn’t update the status, it’s set to “unknown”
- Hook can call
snapctl set-status --code=[<error code>] <status> [<message>]
(or set-health?) - Message is a free form sentence, hopefully capitalized and readable
- Message is forbidden for active status
- Reserve “snapd-*” error code namespace for snapd
- Error code is a custom string
[a-z]+(-?[a-z0-9]){3,}
(dashes in the middle only) - Reuse statuses from juju (maybe not all of them)
- active /
maintenance/ waiting / blocked / error
- active /
- Still need to consider whether to use “active”, due to conflict between “active daemon” and “health=active”
- Client reports to server the status in two situations:
- When the health checks are run during a change, the pre/post status is immediately reported afterwards
- When the health checks are run “at rest”, the pre/post status is aggregated and reported on the next exchange
- Should also report a green→green situation so the server can differentiate between “bricked” and “all good”
- When do we actually revert a refresh automatically:
- Status going from active→error across a refresh
- Status going from active→blocked across refresh, requesting manual refresh instead? Discuss this further, probably not for first release.
- Reverts should never be triggered at rest, because we don’t know what caused the error
- Health hook name as check-health might be more indicative of desired behavior than update-status
Avoid issues in juju status hook:
- Frequent updates cause writes too frequently
- Not require message for the active status