Disabling automatic refresh for snap from store

Original response folded for not contributing much to the topic.

I was specifically responding to the actual point you made regarding trust on the snap being published.

I’m glad to hear the automatic update won’t in fact be a problem for you in practice. This is the key thing we want to ensure.

It’s going to be a problem for me in practice because in practice, I’m not going to use it.

Did you miss the part where I said, “the entire auto-update feature is adding work for me?” You’re talking circles around me at this point, and I can see I’m not going to get any further here. I hope you become less stubborn in the future, but for now, thank you for taking the time to talk to me as long as you have.

3 Likes

I’ve posted the proposed syntax for within-the-month refresh windows as an answer to this existing topic:

Not trying to be confrontational but technically yes that does seem to be what snappy is doing at the moment…

Pretty sure the above quote proves it :stuck_out_tongue: snappy won’t allow disabling updates because snappy worries that people just won’t update - it doesn’t trust its users. Phrasing it as Jacob has is confrontational but correct.

I can see the benefit of taking this decision in the medium-term though, it provides an incentive to produce features that will encourage developers not to use the kill switch if it is eventually introduced (similar to Niemeyer’s decision to block improving snapd-xdg-open packages so as to produce an incentive to get that integrated into snapd itself). Sorry for the notifications spam BTW, but you write interesting stuff!

This view is way too narrow … what snap tries to do is to take away any need for having to care about updates by improving the package environment in a way that software can take care of it itself, can do self tests, can automatically roll back to the last working version.

The only way to achieve security of software (to protect it from mirai attacks or from infection by encryption trojans) is to keep it with the least amount of vulnerabilities by keeping it updated in the timeliest possible manner. The only factor that breaks this principle is human intervention.

Let’s take a look at a typical sysadmin today. When an upgrade comes in he will hold it back and run a bunch of tests … if these tests succeed he will check for the best time to apply the update to all users and roll it out … or perhaps he is doing staged updates and gives it to a small subset of his users first …

Now imagine the package management actually offers to have these tests included in the package by upstream, so admins can submit their use case tests to be shipped and run automatically (including auto-rollback) by the package management… it also has a scheduling feature and rollout control …

What snappy tries to do is not take away trust from anyone, but improve and encourage automation by providing an easy environment for it. If there is any trust shifted around then it is actually shifted towards developers and their ability to ship good tests.

The final target is completely self-maintained machines through automation, the developers working on this (including the ones from the community i met) are way to excited about the technical aspects and possibilities of this to actually think about trust or dis-trust or any other political topics :wink:

4 Likes

In the case of an IoT device, would an application snap be able to use that same mechanism to schedule updates of the core or kernel snap if those updates would interrupt operation? I haven’t read as much about the update of those snaps or how their updates interacts with the system. It would be good to have more information on.

Indeed they would. Both the scheduling and the deferring would work for all snaps, including kernel, gadget, core, and any other application pre-seeded into the device as well.

1 Like

It’s certainly very exciting stuff! :smiley:

I wanted to echo the concern that other power users have about needing control over the update process that goes far beyond just deferring updates for N hours. As a real-world example, I manage some devices that run Ubuntu Core and are deployed throughout a city. Thankfully, we disabled the automatic refresh timer before deploying them because just today we learned that some revision of the pc-kernel snap between 45 and 68 introduced a change that breaks our device’s functionality. This time, it seems to be an obscure issue with running hostapd on a particular ath10k device. These kinds of stability issues are not all that uncommon in the realm of wireless hardware. We had another case where an automatic update to the core snap changed how confinement works and broke our application. Let us forget about the specifics and think about the implications for systems in a production environment. I think it is expecting too much if we think snap authors are going to write tests that anticipate all possible failure cases, especially when snaps from different organizations interact with each other. Have you considered including hooks such that after any snap (e.g. pc-kernel) is updated, any other snap (e.g. a user-installed snap) can run tests and potentially block the update?

We are looking to deploy our devices in cities all around the U.S., and I am faced with the painful decision of either disabling automatic updates entirely or waiting nervously for the day that an update to core or pc-kernel breaks them all. What would you do?

5 Likes

Quoting @niemeyer from above (nobody ever talked about “hours” in this thread)

No update of any of the official packages ever goes into stable without a testing period in the beta and/or candidate channels (there is a whole QA team working on that). The simple solution is to have a device (or a few) that you monitor via software, that are on the beta channel and that notify you when something breaks … You could go as far as having your monitoring tool automatically delay the upgrades of your stable devices if your automated function-tests fail.

Additionally report a bug about the found regression so that the release into stable will be held back … (this is indeed automatable as well … )

Alternatively to that you can surely have Canonicals QA team do the above checks on your hardware directly as part of a release test as a paid-for commercial option (at least: if we dont offer such a service yet, it is about time we do :wink: )

1 Like

You are absolutely right. I think I saw another thread that used the wording “N hours” and confused them in my mind. However, I do think think some use cases require mechanisms other than deferring, whether you cap it at 24 hours or two months.

Thank you for the great suggestion, and we will definitely try that out going forward. As for our current situation, it would really be ideal to have a mechanism to lock a snap (pc-kernel) on a known working version until we can verify for ourselves that the regression has been resolved and unlock it. Actually, we do have such a mechanism, but it feels very much like we are trying to work around snappy. On our deployed systems, we disable automatic updates globally and use the snapd API to refresh snaps individually to known working versions. What can I do? It is my job to make sure our deployed systems stay in working condition. I cannot expect Canonical’s QA team to defer releasing essential software updates (core, pc-kernel, etc.) to the world indefinitely just because some funny guy on the Internet (me) says there was a regression on his particular hardware platform.

By the way, I do think the private snap store / brand store is a viable solution for our problem. I think it is a relatively new offering because I cannot find much information about it, including pricing.

1 Like

You definitely can, the Canonical QA team is exactly interested in avoiding any kind of regressions and will happily hold back an update to stable (unless it is a serious security fix, but then they will consult the security team and the reporter of the issue about it). If we offer a kernel snap that is supposed to support your HW we definitely do never ever want it to release with regressions …

OTOH there is only a limited set of hardware to test on and feedback from funny guys like you :slight_smile: who use some hardware setup not included in the current test process is essential … i’m sure @fgimenez (as an important person in our QA and release process) agrees with that.

1 Like

Doubling down on what @ogra said, we did hold back updates before for this exact reason. This was part of why 2.25 never made it into stable, for example. So if you have serious breakages, by all means please report them and we’ll hold the update back.

1 Like

I’d like to reiterate on that - this functionality is absolutely necessary. When decision-making is forcefully taken away and there is no ability to roll back on your own, people will eventually either do this or stop using the package format - if their job depends on it (they have SLAs with their clients) or if somebody’s life depends on some software to be up when it is told to stay up there won’t be much room for discussion.

There are classes of software that do not work well with automatic upgrades.

  • One cannot just auto-upgrade a virtual machine process (QEMU) - sometimes security updates come out but you cannot kill them because they don’t exit on timeout;
  • One cannot auto-upgrade an almost dead storage cluster or a database: what if my cluster is in a state where it takes one service failure to completely ruin/corrupt the whole cluster? You cannot solve certain issues before “11 PM”;
  • Sometimes even patch versions are not considered safe to apply right away: https://www.rabbitmq.com/clustering.html#upgrading “This will generally not be the case when upgrading from one patch version to another (i.e. from 3.0.x to 3.0.y), except when indicated otherwise in the release notes; these versions can be mixed in a cluster. Therefore, it is strongly recommended to consult release notes before upgrading.”. What if vendor tests and QA miss that requirement from a third-party dependency? Who’s responsible in this case if your whole production cluster is down or corrupted in such a way that the issue fires in a month?
  • RTOS kernels and software running on them. Cars have hypervisors nowadays. They also have mobile internet connection. Certain types of software may come up (like visual assistance, cruise control etc.) which will require certain safety guarantees. I would very much like a car to ask me if I’d like to upgrade before it does it by itself.

I am not against “upgrade by default” but in certain cases you have to offload the final decision to some other system. It may be a human, an AI, an automatic or automated system which will calculate whether it is safe to proceed based upon certain criteria. I hope examples above illustrate it good enough.

If anybody here is familiar with Control Theory, not having a hook to block upgrades is the same as removing the feedback loop from this diagram: https://en.wikipedia.org/wiki/Feedback#/media/File:Set-point_control.png

Other considerations:

  • how do I do Blue/Green or Red/Black deployments for patch versions if my software vendor does not provide tracks for patch versions?
  • what if I don’t trust my software vendor by default and run everything through a staging environment as an enterprise?
  • what if I am a casual user and I got a snap from some author that I don’t really trust? What if I later learn that he got hacked and attackers have silently pushed a backdoor to all systems where that snap is installed? What if I know about this right away but I have no access to my system to mitigate the issue before it’s too late?

To summarize: if this mechanism is to be generic for all kinds of software, it needs to provide more options to control snap distribution and upgrades at the snapd and snap package levels, otherwise it is not generic enough.

My personal advice is to reconsider adding such functionality at the snapd level. Brand store will be needed even if a snapd-level switch is introduced - I don’t think people generally invest time in such infrastructure unless they have unlimited resources and it’s a good idea to have that offered.

6 Likes

Let me add another scenario for you to be aware of and another possibility of resolution. About a month ago i added OSSEC (a HIDS) to a restricted environment I support. Shortly thereafter I got a notification that snapd couldn’t contact its server. The reason is that the standards for this particular environment require that outbound access be restricted to only business-required processes. Snapd was being blocked by the firewall because i didn’t even know it “called home” (I didn’t even know what snapd was until about two months prior to that). Environments like this (HIPPA, Finra?, PCI, Sarbanes-Oxley?, this European requirement, etc) require control. Any software update in the environment is a significant event. However, other requirements state that software must be kept current from a security perspective so it’s a double bind. Pragmatically speaking, even if i opened the firewall for snapd, the overhead of reviewing all the HIDS alerts I could get would be prohibitive. “Control” is the key operative word.

Another alternative to consider for addressing this would be a rating system for upgrades: “Do it or die” (critical security flaw) down to “It would be nice when/if you get around to it”. Software providers give the update a rating, consumers get to set the allowed level for automatic updates.

With all the above, I will say that you may have to come to a point where you say “Your needs and our solution are not sufficiently compatible, you need to forego using it”. As long as snapd doesn’t become the next systemd and users have a choice, that may be the fallback resolution of last resort.

5 Likes

Is that ‘some point’ soon or can you still introduce features to mitigate everyone’s concerns above?

1 Like

You know, it would not be really hard for somebody to fork Snap and develop a patch for Snap which disables auto-updates. You would have to clone Snap’s code, apply patch, and re-build, but theoretically possible. Not advised though.

1 Like

I was browsing through a Fork I did, and found something interesting…

https://github.com/gjsman/snapd/blob/master/data/systemd/snapd.refresh.timer

AFAIK the snapd.refresh.timer unit isn’t used anymore and snapd does the scheduling now internally.

I’ve been researching snapd as an option for improving the software packaging, delivery, and installation in a set of projects. While trying to better understand how Snaps work, I came up with a few questions and concerns, and that eventually brought me to this discussion.

Unfortunately, that’s also led me to conclude that snapd is currently not suitable for my use. I’m looking at servers with 24/7 production expectations, and revenue tied to their availability. We have specific internal processes governing how and when updates can be applied, and the process they have to go through.

The lack of ability to fully control and disable automatic updates from the client-side makes snapd a non-starter for my needs. I appreciate your desire to keep systems secure and up-to-date, but in the Enterprise world where I live, I’m responsible for the systems, so I need to have full control over the update process for everything on the box, period. To put it bluntly, they’re my servers; I’m the one that will get paged when there are issues with them, and I’m the one who is responsible both for their uptime, and for keeping them up-to-date.

What makes sense on a desktop or for casual end users just doesn’t work in my world. A download and wait for install would probably suffice for most of my requirements, but any sort of automatic installation that can’t be disabled and manually controlled is a deal-breaker. If I brought up unattended updates on our production systems during our change control meetings, I’d literally get laughed at.

I’ll try to continue following this discussion. snapd looks like a great fit for some of my use cases, and like something that could be hugely beneficial to me, if and when it adds the basic functionality that I required for my environment. Until then, I’ll keep evaluating alternatives.

8 Likes