Proposal: Enhanced snapd self-healing and error management functionality

The purpose of this thread is to propose a more robust method for snapd to handle installation and post-installation runs. This can be valid for any situation where snapd is not part of the base system configuration (e.g.: certain non-Ubuntu distributions), the system base state configuration is unknown, or the system state has changed (e.g.: updates).

Installation

There are two levels of checks:

  • Pre-install sanity checks for platform and/or system compatibility.
  • Post-install sanity checks to verify that all components work and communicate correctly.

The implementation of the checks ought to be independent of snapd, e.g.: shell script, to avoid any cyclic dependency on snapd checking itself. The checks can then be implemented for each step in the snapd stack.

Pseudo-code

if !“supported platform” - warn|exit
if !“apparmor + necessary kernel modules” - fix|warn|exit
if !“necessary libraries/systemd/daemons” - fix|warn|exit
if !“network/socket/etc” - fix|warn|exit
if !“other capabilities - disk/permissions” - fix|warn|exit
if !“core/core18/etc” - fix|warn|exit
if !“dummy run” - fix|warn"exit

First run (and potentially any arbitrary run)

The second part is a possible chicken & egg problem where the system state changes from the state it was during the snapd installation and now prevents snapd from working correctly.

There are two possible options:

  1. Try to run and fail - this means snapd provides an error from within the context of itself. This means that if snapd cannot run or execute some of its code, the errors may not be complete.
  2. Bootstrap snapd execution with a tiny statically compiled checker/shim, which runs the basic checks (we can define what this subset must or needs be) and then “hands” over to the service itself. This means we can reliably inspect system state.

Pseudo-code:

start checker
check conditions abc (e.g. socket, disk space, permissions).
if conditions=true, jump to main snapd code.
if conditions=false, warn|exit OR
if conditions=false, jump to healer code
healer code = same set of instructions we do for the installation

Ideas, thoughts, etc?

Do you propose running these when snapd is installed or when snaps themselves are installed? Additionally, if you mean snapd itself, should these checks be part of the traditional linux packaging or i.e. the core/snapd snaps?

The checks can be either/or.

You can run snapd so that only service startup is checked, or every time a snap command runs. These second could be done if there’s high probability for system state changes that could affect the integrity of the snap ecosystem (updates, frequent config changes, deletion of data, etc).

I would make these checks part of the snapd (whichever package), because you cannot rely on having these available on target systems separately. If the two are decoupled, then we’re back to the chicken & egg problem. This means the checks can be part of the snap package - but for the very first setup and installation, they need to be made available however snapd is installed on the target system.

So if snapd is not installed, the installation then becomes:

  • snapd package is installed (non via snap).
  • subsequent snapd updates can be distributed as snaps - however, if the service fails to run, the fallback always go to the snap-independent checker (statically compiled and/or shell script).

@mvo @niemeyer Any thoughts on this fellas?