Snap general best practices in production & auto-refresh LXD is stuck in "Doing" status since 2 days

This topic concerns how to deal with snap in a production environment. I have faced the following issue recently and it leds me to think on how to have a good snap config to be used in a production environment.

The issue was: I have a 3 nodes cluster where LXC is used to create and deploy containers. 1 month ago, we had an issue on both our n1 and n3 which was the same as this one : snap auto-refresh LXD but it is stuck on the following:

Status  Spawn                     Ready                     Summary
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Ensure prerequisites for "lxd" are available
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Download snap "lxd" (11964) from channel "stable"
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Fetch and check assertions for snap "lxd" (11964)
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Mount snap "lxd" (11964)
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Run pre-refresh hook of "lxd" snap if present
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Stop snap "lxd" services
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Remove aliases for snap "lxd"
Done    2 days ago, at 14:02 UTC  2 days ago, at 14:02 UTC  Make current revision for snap "lxd" unavailable
Doing   2 days ago, at 14:02 UTC  -                         Copy snap "lxd" data
Do      2 days ago, at 14:02 UTC  -                         Setup snap "lxd" (11964) security profiles
Do      2 days ago, at 14:02 UTC  -                         Make snap "lxd" (11964) available to the system
Do      2 days ago, at 14:02 UTC  -                         Automatically connect eligible plugs and slots of snap "lxd"
Do      2 days ago, at 14:02 UTC  -                         Set automatic aliases for snap "lxd"
Do      2 days ago, at 14:02 UTC  -                         Setup snap "lxd" aliases
Do      2 days ago, at 14:02 UTC  -                         Run post-refresh hook of "lxd" snap if present
Do      2 days ago, at 14:02 UTC  -                         Start snap "lxd" (11964) services
Do      2 days ago, at 14:02 UTC  -                         Remove data for snap "lxd" (11727)
Do      2 days ago, at 14:02 UTC  -                         Remove snap "lxd" (11727) from the system
Do      2 days ago, at 14:02 UTC  -                         Clean up "lxd" (11964) install
Do      2 days ago, at 14:02 UTC  -                         Run configure hook of "lxd" snap if present
Do      2 days ago, at 14:02 UTC  -                         Run health check of "lxd" snap
Doing   2 days ago, at 14:02 UTC  -                         Consider re-refresh of "lxd"

When we treated the issue concerning n1 and n3 we decided to reboot the hosts and it was sufficient but not viable at all on a production environment. When talking with Stephane Graber on Github, he advised me to run systemctl restart snapd instead of rebooting the hosts. He also talked about reporting this issue here, so here I am :slight_smile: …!

Did anyone else experienced this issue? More generally, how did you set-up your timer for auto-refresh in a production environment? I red that it is not possible to get rid of it, so how to define an efficient timer? If you have any other tips I would be happy to hear them !

EDIT : I just did the systemctl restart snap and it succeed. But i’m very interested if some of you using lxc with snap have tips on how to use snap and LXC in a production environment (i.e. how to set an auto-refresh timer, when, or anything else related…).

By advance thank you if you can help!