Debugging a snap with failing daemons is difficult

It can be difficult to debug issues that might happen from installing or refreshing a snap where one of the daemons fails as part of the install/refresh. This is because the installation is always aborted, so for example if the root cause of the issue is with some corrupted data in $SNAP_DATA, etc. the logs from the system journal during installation may not be helpful to debug the issue.

It would be nice if there was a development flag that folks could use to essentially have snapd ignore the daemons that failed to start and still continue with the installation anyways. Currently my workflow for this kind of debugging if I don’t have access to the snapcraft recipe that built the snap would be:

  1. snap download the snap
  2. unsquashfs the snap
  3. modify the install hook to snapctl stop --disable $SNAP_NAME.failing-service
  4. Re-pack the snap with snap pack
  5. Install the new snap with --dangerous

Note that this still is not perfect though because we no longer have any auto-connections, etc. from the store assertion that may have been granted to the snap, since we modified the snap contents, and then would probably need to continue with:

  1. re-connect all snap interfaces with snap connect

Then finally we are ready to run the failing service with and do actual debugging with

  1. snap start $SNAP_NAME.failing-service

Obviously if you are the snap author you have the snapcraft recipe and you could re-build it with the install hook that disables all services and push it up to the store in a branch (thus getting all the auto-connections you need), but still that introduces a lot of overhead to debug an issue like this.

Ideally I think the flow for debugging this would look like

  1. Install the snap such that failed daemons are ignored with snap install --ignore-failed-daemons $SNAP_NAME
  2. do your debugging with $SNAP_DATA and snap start

Also note that this problem doesn’t really happen if you declare daemon: simple because then systemd doesn’t really perform a check to see if the daemon started successfully. If you using any of the other daemon settings though you are liable to have this problem.

Thoughts?

For an easy example of a failing snap, see my nope snap:

$ snap install nope
error: cannot perform the following tasks:
- Start snap "nope" (2) services ([start snap.nope.nope.service] failed with exit status 1: Job for snap.nope.nope.service failed because the control process exited with error code.
See "systemctl status snap.nope.nope.service" and "journalctl -xe" for details.
)
1 Like

As an orthogonal side-note I will comment that one other way to solve this in a more generically useful way would be to implement some support for providing config items to the snap at install time such that you could do something like:

snap install $SNAP_NAME -- key=val

and it’s as if you did snap set $SNAP_NAME key=val before installing the snap. Then you could implement this kind of debugging with snap install $SNAP_NAME -- disablesvcs=true and the install hook can get the value of that with snapctl get disablesvcs and if that’s true then just disable all services and move on.

That does have the disadvantage of needing cooperation from the snap’s install hook so even with this you would still have problems with snaps you don’t control if they don’t implement support for that in the install hook.

The approach seems a bit too specific, a snap can also fail in one of its hooks. Also ignoring errors regarding starting services might end up making a later hook fail as well which will again undo the install. I fear, in general, ignoring errors will tend to compound confusing snapd and maybe the user.

A different but slightly more complex approach would be to have options to tell snapd to pause a single install/refresh change after/before a given task either on the Do or the Undo sequence. On the paused change then:

  • snap abort would unpause it and abort it as expected
  • conflict checks would be relaxed (details to be thought )
  • there would be snap debug commands (to be designed) to manipulate/step it

What about if you just had an install option to have snapd ignore failed services and failed hooks (and ignore possibly anything else that fails from inside the snap)? i.e. don’t ignore failures like invalid snap.yaml spec, just running code from inside such as starting services, running hooks, and possibly other things I’m not aware of.

This would be nice to have and as long as the paused change allows me to inspect the snap’s state and attempt to debug it by still running daemons with snap run that would be sufficient.