I had an idea for addressing the auto refresh issues to satisfy both the intent of snap for developers and the palpable risks that forced updates present to the install base. This post is about using the capabilities snap and containers like LXD provide us to pragmatically solve this real issue and hopefully satisfy all parties.
First a little background on our recent experience.
This past weekend LXD pushed 3.6 to stable. To put it simply, it was not stable. We recently made the change to utilize snap, as it is so recommended as the preferred package management for a large implementation of hosts with clean standard Ubuntu 18.04. A cluster of 30+ hosts providing several hundred containers, most of which are mission critical business systems for a customer user base of 20K, was crippled by this failed stable release. To the credit off LXD and Snap, the running containers stayed up despite the bug which caused LXD to fail restart after the refresh; however all functionality to snapshot, deploy new containers, and restart machines was down. In cloud environments where those are near or even absolute core functions needed for daily operation, that is a significant issue.
Again, to the credit of LXD developers, the fix was deployed to stable within about 20 hours of the initial bug report. To ignore the risks of snap auto-refresh in use cases such as LXD where the underlying service infrastructure can be crippled, is to ignore decades of best practices for development, operations and quality management. As a developer of enterprise critical systems, I fully understand and appreciate the Snapcraft “disruptor” approach to solve some longstanding development life-cycle challenges; but if being a “disruptor” actual disrupts the functionality of major infrastructure in the market enough times, the viability of snap will face significant challenges. In this first day, we already are facing to respond to many customers who want us to exit from our recent commitment to utilize snap. These include major brands, hospitals, and Fortune 500’s It is not hard to imagine how quickly this could spiral.
But my post is actually more about how to address this, as a believer in the benefits of snap despite being a “victim” of the inherent risks. The current option to delay updates does not address the issue, at least in this case from our experience and others posting regarding Snap and LXD, particularly due to the impacts on clustering. LXD itself though could be a viable tool for managing the challenge.
(If not a global solution, this or something like it absolutely needs to be implemented for hypervisor-type packages which have far reaching impacts on entire cloud platforms when their is a refresh failure.)
Implement automatic “previous version” Tracks. Allow user to select to follow this “Penultimate Track”. Each version Stable release triggers creation of the Penultimate Track (previous version) which can remain for say 30 days or no more than 2 previous version Tracks. Perhaps this is the intent or best practice for Candidates and Tracks, but since the Candidate often seems to become “Stable” with little notice (I believe 1 day in this instance) it isn’t suitable; And Tracks are up to the developer and sometimes few and far between.
Global option to turn off Snap without disabling the packages deployed. This sets the bar higher than it would by allowing disabling of refresh on individual snaps. A user would have to have a significant enough issue with a particular snap to take the step of disabling all together. By allowing this without the need to redeploy or in many cases build the app from source, longterm adoption of snap is not significantly impacted. If we end up having to move off of snap due to this latest issue, I can guarantee neither we nor our customers will support moving back to snap; but if we have a failsafe to protect the environment while issues of snap or the maintainer of a package are addressed, we wouldn’t have an issue re-enabling it.
LXD based validation testing- We are already working on defining a script and process for using the latest LXD ability to convert a host into a Container. Our initial approach is to create a tool to create a LXD container of the current host state, spin it up, snap refresh, and validate it does not fail. We manually did it today and at least in this case it would have shown that their was an issue prior to killing our production environment. Long term, something like this would be incredible as part of snap, whereby a container was created with all the same settings/configs as the machine being refreshed and tested before forcing the refresh on the actual system…This cool tool, would only be useful though, if snap provides the required mechanisms to hold off on the forced refresh if validation fails.
Rating system- Kind of a separate idea, just putting out there for discussion. A rating system based on bug reports or user feedback of failures from Snap refreshes would not only be informative to the users about the reliability of applications, but could also be used as a mechanism for enforcing rules requiring maintenance of prior release tracks as well as the timeline a developer is permitted to force auto refresh. For example if the maintainer has refresh failures indicated by ratings in the last 12 months, they must maintain the Prior Version Track for 90 days.
Perhaps some of this has been discussed or addressed before. As I mentioned we are recent snappers. I appreciate the discussion and any comments and hope this a useful contribution to the discourse. Ultimately we like many others will have to make a decision, sooner than we’d like, to mitigate the risks of auto refresh or be forced by the customers/users to abandon snap.