Disabling automatic refresh for snap from store

lance · August 2, 2017, 9:52pm

I wanted to echo the concern that other power users have about needing control over the update process that goes far beyond just deferring updates for N hours. As a real-world example, I manage some devices that run Ubuntu Core and are deployed throughout a city. Thankfully, we disabled the automatic refresh timer before deploying them because just today we learned that some revision of the pc-kernel snap between 45 and 68 introduced a change that breaks our device’s functionality. This time, it seems to be an obscure issue with running hostapd on a particular ath10k device. These kinds of stability issues are not all that uncommon in the realm of wireless hardware. We had another case where an automatic update to the core snap changed how confinement works and broke our application. Let us forget about the specifics and think about the implications for systems in a production environment. I think it is expecting too much if we think snap authors are going to write tests that anticipate all possible failure cases, especially when snaps from different organizations interact with each other. Have you considered including hooks such that after any snap (e.g. pc-kernel) is updated, any other snap (e.g. a user-installed snap) can run tests and potentially block the update?

We are looking to deploy our devices in cities all around the U.S., and I am faced with the painful decision of either disabling automatic updates entirely or waiting nervously for the day that an update to core or pc-kernel breaks them all. What would you do?

ogra · August 3, 2017, 9:18am

Quoting @niemeyer from above (nobody ever talked about “hours” in this thread)

No update of any of the official packages ever goes into stable without a testing period in the beta and/or candidate channels (there is a whole QA team working on that). The simple solution is to have a device (or a few) that you monitor via software, that are on the beta channel and that notify you when something breaks … You could go as far as having your monitoring tool automatically delay the upgrades of your stable devices if your automated function-tests fail.

Additionally report a bug about the found regression so that the release into stable will be held back … (this is indeed automatable as well … )

Alternatively to that you can surely have Canonicals QA team do the above checks on your hardware directly as part of a release test as a paid-for commercial option (at least: if we dont offer such a service yet, it is about time we do )

lance · August 3, 2017, 3:03pm

You are absolutely right. I think I saw another thread that used the wording “N hours” and confused them in my mind. However, I do think think some use cases require mechanisms other than deferring, whether you cap it at 24 hours or two months.

Thank you for the great suggestion, and we will definitely try that out going forward. As for our current situation, it would really be ideal to have a mechanism to lock a snap (pc-kernel) on a known working version until we can verify for ourselves that the regression has been resolved and unlock it. Actually, we do have such a mechanism, but it feels very much like we are trying to work around snappy. On our deployed systems, we disable automatic updates globally and use the snapd API to refresh snaps individually to known working versions. What can I do? It is my job to make sure our deployed systems stay in working condition. I cannot expect Canonical’s QA team to defer releasing essential software updates (core, pc-kernel, etc.) to the world indefinitely just because some funny guy on the Internet (me) says there was a regression on his particular hardware platform.

By the way, I do think the private snap store / brand store is a viable solution for our problem. I think it is a relatively new offering because I cannot find much information about it, including pricing.

ogra · August 3, 2017, 3:15pm

You definitely can, the Canonical QA team is exactly interested in avoiding any kind of regressions and will happily hold back an update to stable (unless it is a serious security fix, but then they will consult the security team and the reporter of the issue about it). If we offer a kernel snap that is supposed to support your HW we definitely do never ever want it to release with regressions …

OTOH there is only a limited set of hardware to test on and feedback from funny guys like you who use some hardware setup not included in the current test process is essential … i’m sure @fgimenez (as an important person in our QA and release process) agrees with that.

niemeyer · August 6, 2017, 9:47am

Doubling down on what @ogra said, we did hold back updates before for this exact reason. This was part of why 2.25 never made it into stable, for example. So if you have serious breakages, by all means please report them and we’ll hold the update back.

dmitriis · September 18, 2017, 9:58pm

I’d like to reiterate on that - this functionality is absolutely necessary. When decision-making is forcefully taken away and there is no ability to roll back on your own, people will eventually either do this or stop using the package format - if their job depends on it (they have SLAs with their clients) or if somebody’s life depends on some software to be up when it is told to stay up there won’t be much room for discussion.

There are classes of software that do not work well with automatic upgrades.

One cannot just auto-upgrade a virtual machine process (QEMU) - sometimes security updates come out but you cannot kill them because they don’t exit on timeout;
One cannot auto-upgrade an almost dead storage cluster or a database: what if my cluster is in a state where it takes one service failure to completely ruin/corrupt the whole cluster? You cannot solve certain issues before “11 PM”;
Sometimes even patch versions are not considered safe to apply right away: https://www.rabbitmq.com/clustering.html#upgrading “This will generally not be the case when upgrading from one patch version to another (i.e. from 3.0.x to 3.0.y), except when indicated otherwise in the release notes; these versions can be mixed in a cluster. Therefore, it is strongly recommended to consult release notes before upgrading.”. What if vendor tests and QA miss that requirement from a third-party dependency? Who’s responsible in this case if your whole production cluster is down or corrupted in such a way that the issue fires in a month?
RTOS kernels and software running on them. Cars have hypervisors nowadays. They also have mobile internet connection. Certain types of software may come up (like visual assistance, cruise control etc.) which will require certain safety guarantees. I would very much like a car to ask me if I’d like to upgrade before it does it by itself.

I am not against “upgrade by default” but in certain cases you have to offload the final decision to some other system. It may be a human, an AI, an automatic or automated system which will calculate whether it is safe to proceed based upon certain criteria. I hope examples above illustrate it good enough.

If anybody here is familiar with Control Theory, not having a hook to block upgrades is the same as removing the feedback loop from this diagram: https://en.wikipedia.org/wiki/Feedback#/media/File:Set-point_control.png

Other considerations:

how do I do Blue/Green or Red/Black deployments for patch versions if my software vendor does not provide tracks for patch versions?
what if I don’t trust my software vendor by default and run everything through a staging environment as an enterprise?
what if I am a casual user and I got a snap from some author that I don’t really trust? What if I later learn that he got hacked and attackers have silently pushed a backdoor to all systems where that snap is installed? What if I know about this right away but I have no access to my system to mitigate the issue before it’s too late?

To summarize: if this mechanism is to be generic for all kinds of software, it needs to provide more options to control snap distribution and upgrades at the snapd and snap package levels, otherwise it is not generic enough.

My personal advice is to reconsider adding such functionality at the snapd level. Brand store will be needed even if a snapd-level switch is introduced - I don’t think people generally invest time in such infrastructure unless they have unlimited resources and it’s a good idea to have that offered.

Leroy · September 26, 2017, 4:28pm

Let me add another scenario for you to be aware of and another possibility of resolution. About a month ago i added OSSEC (a HIDS) to a restricted environment I support. Shortly thereafter I got a notification that snapd couldn’t contact its server. The reason is that the standards for this particular environment require that outbound access be restricted to only business-required processes. Snapd was being blocked by the firewall because i didn’t even know it “called home” (I didn’t even know what snapd was until about two months prior to that). Environments like this (HIPPA, Finra?, PCI, Sarbanes-Oxley?, this European requirement, etc) require control. Any software update in the environment is a significant event. However, other requirements state that software must be kept current from a security perspective so it’s a double bind. Pragmatically speaking, even if i opened the firewall for snapd, the overhead of reviewing all the HIDS alerts I could get would be prohibitive. “Control” is the key operative word.

Another alternative to consider for addressing this would be a rating system for upgrades: “Do it or die” (critical security flaw) down to “It would be nice when/if you get around to it”. Software providers give the update a rating, consumers get to set the allowed level for automatic updates.

With all the above, I will say that you may have to come to a point where you say “Your needs and our solution are not sufficiently compatible, you need to forego using it”. As long as snapd doesn’t become the next systemd and users have a choice, that may be the fallback resolution of last resort.

Ads20000 · September 26, 2017, 9:27pm

Is that ‘some point’ soon or can you still introduce features to mitigate everyone’s concerns above?

G.S.1 · September 27, 2017, 1:04am

You know, it would not be really hard for somebody to fork Snap and develop a patch for Snap which disables auto-updates. You would have to clone Snap’s code, apply patch, and re-build, but theoretically possible. Not advised though.

G.S.1 · September 27, 2017, 1:12am

I was browsing through a Fork I did, and found something interesting…

https://github.com/gjsman/snapd/blob/master/data/systemd/snapd.refresh.timer

morphis · September 27, 2017, 5:16am

AFAIK the snapd.refresh.timer unit isn’t used anymore and snapd does the scheduling now internally.

topher · October 4, 2017, 6:35am

I’ve been researching snapd as an option for improving the software packaging, delivery, and installation in a set of projects. While trying to better understand how Snaps work, I came up with a few questions and concerns, and that eventually brought me to this discussion.

Unfortunately, that’s also led me to conclude that snapd is currently not suitable for my use. I’m looking at servers with 24/7 production expectations, and revenue tied to their availability. We have specific internal processes governing how and when updates can be applied, and the process they have to go through.

The lack of ability to fully control and disable automatic updates from the client-side makes snapd a non-starter for my needs. I appreciate your desire to keep systems secure and up-to-date, but in the Enterprise world where I live, I’m responsible for the systems, so I need to have full control over the update process for everything on the box, period. To put it bluntly, they’re my servers; I’m the one that will get paged when there are issues with them, and I’m the one who is responsible both for their uptime, and for keeping them up-to-date.

What makes sense on a desktop or for casual end users just doesn’t work in my world. A download and wait for install would probably suffice for most of my requirements, but any sort of automatic installation that can’t be disabled and manually controlled is a deal-breaker. If I brought up unattended updates on our production systems during our change control meetings, I’d literally get laughed at.

I’ll try to continue following this discussion. snapd looks like a great fit for some of my use cases, and like something that could be hugely beneficial to me, if and when it adds the basic functionality that I required for my environment. Until then, I’ll keep evaluating alternatives.

niemeyer · October 4, 2017, 2:01pm

@topher Thanks for the feedback. This is the sort of conversation that can change the direction we’re going towards as we do want to make sure your needs are covered and that snaps are indeed a good option for that scenario. For the record, I personally have multiple servers deployed exclusively with snaps, which means the conversation here will surely be productive.

It’s also a good time for this topic, by the way, as we have a meeting next week to discuss the current roadmap and make sure everyone is aligned. A good chance for me to raise such issues with other stakeholders.

So here is the problem we need to solve: you want to take control over your updates which is obviously reasonable and needs to be addressed. At the other end of the spectrum, we also want to ensure that systems that don’t have a team like yours paying attention to security issues on a daily basis will be updated regardless. We also want to encourage practices on the publisher end which take into account that they are indeed affecting production systems when they release an update. This will increase the quality of the updates you get, whether you have the tight control you already have or not.

How do we solve that?

The current idea is the following: we’re extending the ability to schedule updates to a monthly basis where you can pick the exact window for the update to take place, both in terms of time and duration. You will also have the ability to delay the update for a period of time by explicitly asking for that, and the current idea is to allow that to push the updates further up to 60 days. Finally, the snap itself will also have a chance to request an update to not take place within those 60 days because it’s a bad time for the local system (e.g. database is in use, drone is flying, etc). Plus, the snap itself will also be able to do the opposite, asking for the update to take place now because it’s the right time.

All of that is in the short term roadmap. The question is: does that solve your worries? If not, how can we tune it to make sure it does solve your concern enough that you’d put snaps in production on your servers and collaborate with us to ensure they continue meeting your needs? Is there a way to do that without introducing a global dead switch which turns every single update off altogether?

Don · October 18, 2017, 2:03pm

@niemeyer: I installed a Rocket.Chat server a few weeks ago and the use of Snapcraft seemed great at first. Then suddenly LDAP authentication stopped working for new users that didn’t log in previously, debugging indicates some bug or something (The user is clearly found, but then it says “User not found”).

Now I have no reasonable easy way of knowing whether it was an update that broke it or something else.
Delaying updates will not fix this, there needs to be a simple way to turn it off completely.

Some random other example: Perhaps I’d like to do some staged rollout with some custom script or whatever, first deploying to one system and then slowly continuing. Autoupdate would interfere.

Autoupdate is a great feature for some use cases, but why should anyone care if I turn it off? Open source is all about putting the user in control, the current way of working does exactly the opposite. Even if I wanted to shoot myself in foot, why disallow it? Why care? Samsung, Huawei, Google, Ubuntu etc. all allow users to turn off updates. Why not?

Windows 10 is working like you describe and people hate it…

If I see the length of the threads about this subject, then it’s inevitable that a disable switch will be included at some point… Why care? Why not just include it now and move on? Demand for a disable switch will only increase as the popularity of snapcraft grows.

I’ll just block it by pointing api.snapcraft.io to 127.0.0.1 in the hosts file for now as a workaround, but I’d prefer to do it through a global switch in the future.

Anyway, that’s just my view on it, thanks for the great stuff you’re doing!

Kindest regards, Don

Ads20000 · October 18, 2017, 4:28pm

When did this happen? If you know when then we can work out whether it’s Rocket Chat or core that is to blame and we can report a bug (use snap changes to see what snaps have been recently refreshed) and you can just use snap revert core or snap revert rocketchat-server to revert the snap to the version that worked previously).

G.S.1 · October 18, 2017, 5:07pm

I can comment here. @Don was probably on the ‘candidate’ channel, except the most recent candidate broke LDAP (it is a known bug). However, the problem (I found this too) is that if you roll back to the ‘stable’ channel, Rocket.Chat breaks and won’t load properly. Thus you are stuck.

Don · October 19, 2017, 3:08pm

Thanks for the tips, however I didn’t mean to hijack the thread.

I found the bugreport on the RC version, however I am using the stable version:
installed: 0.58.4 (1142) 165MB -
refreshed: 2017-10-06 18:40:48 +0200 CEST

I installed it somewhere in september and it worked, and then it broke a few weeks later.
Turned out it found two accounts (one actual account, and one alias) for some users. As it found 2 users, it said “No User Found” while it actually found one (identical) user too many. I tested this previously in september and it worked. However, now I adjusted the LDAP filter to exclude aliases and the issue was resolved.

@G.S.1 points out another example of why autoupdate is a bad idea in some cases. (Most likely the upgrade upgraded the database schema which is incompatible with the previous version?) Normally upon upgrading one would just snapshot the whole VM and have proper restore point. After upgrading it would be possible to immediately start testing the product to make sure everything still works. However, with auto upgrade there may not be anyone available to start working on a problem or the timing might be very bad.

denis · October 22, 2017, 1:57pm

You mean desktop version of Windows 10.

But Windows 10 IoT (which is the direct competitor of Ubuntu Core) has the ability to disable automatic updates in its paid version. Free version of Win10 IoT uses forced updates like Ubuntu Core.

And here is an interesting difference where even in paid version Ubuntu Core doesn’t allow you to disable automatic updates.

G.S.1 · October 23, 2017, 9:36pm

I didn’t know Ubuntu Core even had a paid version.

Ads20000 · October 24, 2017, 2:35pm

~~It doesn’t, as far as I’m aware.~~
(Edit: I realise that doesn’t really add to the conversation but Discourse doesn’t really allow immediate deletion so whelp)