Developer sprint Sep 17th, 2018

Health checks

  • Hook named “check-health” (or similar) is called by snapd to check for health
  • Hook needs to be idempotent so snapd can call it at any time
  • Hook might be called every 5 minutes to begin with, maybe? We can always increase later (but we can’t easily reduce it, without risking breaking applications that expect having more time for it)
  • Have special status “unknown” if the snap hasn’t bothered to define it
  • If the health-check hook is called and it doesn’t update the status, it’s set to “unknown”
  • Hook can call snapctl set-status --code=[<error code>] <status> [<message>] (or set-health?)
  • Message is a free form sentence, hopefully capitalized and readable
  • Message is forbidden for active status
  • Reserve “snapd-*” error code namespace for snapd
  • Error code is a custom string [a-z]+(-?[a-z0-9]){3,} (dashes in the middle only)
  • Reuse statuses from juju (maybe not all of them)
    • active / maintenance / waiting / blocked / error
  • Still need to consider whether to use “active”, due to conflict between “active daemon” and “health=active”
  • Client reports to server the status in two situations:
    • When the health checks are run during a change, the pre/post status is immediately reported afterwards
    • When the health checks are run “at rest”, the pre/post status is aggregated and reported on the next exchange
  • Should also report a green→green situation so the server can differentiate between “bricked” and “all good”
  • When do we actually revert a refresh automatically:
    • Status going from active→error across a refresh
    • Status going from active→blocked across refresh, requesting manual refresh instead? Discuss this further, probably not for first release.
  • Reverts should never be triggered at rest, because we don’t know what caused the error
  • Health hook name as check-health might be more indicative of desired behavior than update-status

Avoid issues in juju status hook:

  • Frequent updates cause writes too frequently
  • Not require message for the active status