One of our users reported that in a cluster of 5 nodes nodes two of them refreshed correctly while the rest three failed in the the “Copy snap data” phase. I am not sure what may cause that. What happens during the “Copy snap data” phase? Our user reported that snapd went into a bad state and recovered only after killing it.
The issue reported is at [1] one of the failing nodes is kept in case we want to take a look.
Thanks for this report, I looked at the report and it’s a puzzling that it hangs in copy data. The report looks like it did not even get to the point where it creates the target directory.
I’ve updated the report on GitHub with the likely cause - snapd launched a “sync” which hangs forever because there was still a rook-ceph volume mounted, so the rbd kernel module is trying to connect to the ceph cluster which is running inside Kubernetes, but microk8s is stopped so it’s unreachable. Another example of many issues in an overly-complex system lining up to cause a problem…