Epochs (stepped upgrades)

  1. Indeed as it would mean the snap is incompatible with itself.

  2. I suggest ignoring unknowns and letting snapcraft catch such issues and report at build time. The problem otherwise is that there’s no way to build in the future a snap that accepts more data without it being incompatible with older snapds. On the other hand, we can always force in the future to have a snap that doesn’t install in the past via the assumes mechanism.

  3. I wasn’t so concerned about the format of ints used, but your second example is a bit of an eye-opener as it looks good and is wrong. As it’s cheap for us, might be worth forcing people to use plain ints.

  4. No, that’s the proposed rule number 1 above, and it’s also the reason why in your first question you bring up sharing at least one value and not all of them:

1 Like

another question: should we accept read: 5 as shorthand for read: [5]?

Doesn’t seem worth it. We have a short syntax that should cover the majority of simple cases. That for example will most likely be simply epoch: 5 alone. Once people decide to skip it and go full syntax, we can ask for more verbosity and precision.

1 Like

The obverse of this is that 010 in yaml would typically be 8, but if we define them as decimal numbers then it’s a 10, and this will surprise a different subset of developers.

If we disallow we need to reject rather than parsing them differently.

The new Epoch complex structure (two lists for read and write capabilities) will be used in the new APIs that are being discussed here.

In the old API (that is being used by current and previous snapds, we’ll do the following:

  • keep the epoch field as a string both in the request sent from the client to the server, as in the server responses

  • the refreshes will only filter by Epoch by getting the latest revision with the same epoch that was indicated by the client (no “epoch evolution” using the old API)

For the record, the issue that motivates the design @facundobatista points out is that old snapds cannot handle rich epochs, and do not understand the epoch semantics at all, so the suggested behavior ensures an old snapd will remain inside its comfort zone, so to speak.

While defining the work required for epochs, we noted that epoch filtering introduces a new concept we haven’t needed before. When resolving which revision should be returned for a snap refresh, we need to pick the next best in its epoch upgrade path. And for that, we need to go through the timeline of releases in the channel the client is tracking. But what is a channel timeline?

There are different possible interpretations for it, but let’s consider the one we are thinking of applying:

  • While the channel is open and getting releases, the timeline seems clear: all revisions released to that channel over time.
  • On the other hand, if a channel is closed and follows another, its history is combined with the history of that other channel (while the channel is closed); if the channel is reopened later, it stops sharing the other channel’s history since then, but it preserves as part of its timeline the releases of that other channel while it was following it.

Then, this means that a refresh could get as a result a revision that was released to a different channel in the past (when filtering the revisions on which we would be apply epoch filtering).

For example, given the following sequence of releases:

Release r1 to edge
Release r2 to beta
Release r3 to beta
Close edge
Release r4 to beta
Release r5 to beta
Release r6 to edge (ie. reopen edge)
Release r7 to beta

The timeline for edge at this point would be:
r1 -> r3 -> r4 -> r5 -> r6
(note how we are including releases to beta while edge was closed and following beta)

If we extend the example above introducing epochs:

    read    write
r1  [1]     [1]
r2  [1]     [1]
r3  [1,2]   [1,2]
r4  [1,2]   [1,2]
r5  [2]     [2]
r6  [1]     [1]
r7  [2,3]   [2,3]

Assuming a client tracking the edge channel doing a refresh from revision r1, if we follow the described timeline, it would get r4 as result.

The timeline definition will decide how to resolve epoch filtering when one (or more) closed channels are involved and that’s why we want to agree on what it means before getting to the implementation.

It feels like we are making this too complicated. I understand the desire to recreate history exactly as it happened, but don’t feel it is necessary in this case.

The intent of epochs is to get you from point A to D by way of B & C if necessary, not to follow an explicit trail of A->B->C->D, if C was only released to a followed channel but the path A->B->D is valid we are OK*. the problem only arises if an epoch-compatible revision was never released in the given channel, but i’d argue that’s an oddity in the publishing sequence.

Simply following the explicit list of revisions released to a channel (and ignoring following) feels a lot clearer to me from a publisher viewpoint and not a difference that an end user will ever see.

*To clarify in my above example I’m implying we are on edge, A, B, & D were released to edge, C only released to beta, and edge was closed at some point while C was in beta.

@noise That’s not really what we are aiming for, at least not without it being more specific. We do care about the sequence and the history, including the intermediate bits. That’s what gives people the ability to control what is the exact release that will be visible and jumping through an epoch, and all intermediate epochs that were published into the channel should be respected as well.

There are indeed a lot of details above about epochs, and they remain relevant. The simplicity here can’t break down the rules we established above and explained the rationale for.

With that said, I’m out of context. So let me read @matiasb detailed coverage of the issue and try to understand what is the problem to be solved.

Right, sorry i wasn’t clear - but I wasn’t saying to ignore the epoch chain, just to ignore followed revisions from another channel while gathering the links.

@matiasb Ah, that’s an interesting edge case indeed, and it creates a pretty convoluted picture if we account for the fact that we support multiple levels of follow ups.

The best possible solution would be to reflect exactly what a client would see in reality. In the example, the closing of edge while it was holding r1 is exactly equivalent to a client as if edge had observed r3 being released into it, and then r4 and r5. Then r6 gets back into edge, and that indeed forms the history you described:

r1 -> r3 -> r4 -> r5 -> r6

That means no matter when one got to follow the edge channel, they’ll observe exactly the same history. Good invariant to hold.

That’s sounds right. I expect this would make the implementation a bit more complex, but the pay off is that people won’t have to think and fix awkward situations by hand, which pays off.

While trying to figure out the best behaviour solution for release history across channels, we found that we need to better understand which behaviour we want when selecting the revision to return in the case of multiple revisions available, even for the same channel.

IOW, when multiple revisions are OK to be returned after applying the Epochs restriction, which one should the server choose?

There are three options:

  1. the most recently released revision with the epoch wins
  2. the most recently released revision with the epoch and the greatest other epoch wins
  3. the most recently released revision with the epoch and at least one greater epoch wins

The first option is the simplest one. Of all the revisions that are Epoch ok, let’s just get the latest one. This option is not only simple, but also in line with current “latest is what will be returned” pre-epochs behaviour, so it’s easy to think about.

However, this one is unworkable. Let’s see this example: let’s say we have the following releases (assuming same epochs in read and write, for simplicity):

r1    [0, 1]
r2    [1, 2]
r3    [2, 3]

If we boot first time a device that has epoch in 0, it will refresh to r1, then to r2 and finally to r3. At some point we found that r1 has a bug, so we fix it and release the corrected code in r4 (of course with epoch=[0, 1]). If we now boot first time another device in epoch 0, it will refresh to r4, and will never go to r2 or r3, getting stuck.

The second option is more complex but avoids the problem just described. It’s the currently defined behaviour (per talks we have had in this post in the past). The downside of this approach is that the revision selection is rigid. See this example, we have:

r1    [0, 1]
r2    [1, 2, 3]
r3    [1, 2]

When a device with epoch 0 needs to refresh it will go to r1, and then from r1 to r2 (as it has the greatest epoch!). No matter what revisions we release in the future, the only way for the device to upgrade to something different is to release a revision with a number greater than or equal to 3 in its epoch list.

The third option is slightly different, as a revision may be selected even if it doesn’t have the greatest epoch, but as it has at least one greater it allows an upgrade path to be established. See the following example:

r1    [0, 1]
r2    [1, 2, 3]
r3    [1, 2]
r4    [0, 1]

In this case the device bootstrapped with epoch 0 will end up also in r2, but through a longer sequence of refreshes. It first goes to r4 (not r1, because r4 has a higher epoch (1) than the matching one (0), and it’s released later than r1). In the next refresh it will go to r3 (which beats r4 as it has one higher epoch than the matching one (1), and beats r2 (which also have higher epochs) because it was released later). And finally it will refresh to r2, because now it’s the only one with “exceeding” epochs beyond the matching ones.

So, which option you think it’s the best? Thanks!!

Isn’t this described in rule number 2 of the original rule set presented in Aug/2017 above?

  1. The store always offers the most recent snap able to read the highest epoch among those that may be installed (see rule 1 and 6).

This still seems fine for your example. It will install r4, which is the only one that may be installed given we still have data at epoch 0, then r2, which can now be installed and is able to read epoch 2, and then r3, which may now be installed and can read epoch 3.

That case as specified in the original ruleset has the flaw that it’s possible to make a mistake from which it isn’t possible to sensibly recover. If I release a revision for [2, 3, 4] and later realise I need separate [2, 3] and [3, 4] revisions, the [2, 3] revision will never be picked as the ancient [2, 3, 4] revision will have precedence. We don’t have other situations today where you can get yourself into a corner like that.

One can always unrelease the snap if it was an actual mistake, which takes them out of the corner. When it is not a mistake, though, the logic seems right: the [1, 2, 3] is indeed preferable to something that can only read [1, 2]. The former allows a revert (downgrade), while the latter doesn’t. It’s also just more intuitive… the data format rolls forward, and hop points in the middle will need to support more than one format on the way to the current tip that is higher up. The ones with more compatibility are better hop points than the ones with less compatibility. Both [1, 2, 3] and [2, 3], and even [3], look to me like more recent epochs than [1, 2], and thus a more natural target.

When all revisions are of equal quality, the one that gets the upgrade further along is is indeed the better choice. But in the case where I have a [2, 3, 4] revision but realise the direct 2 -> 4 handling is buggy and can’t sensibly be fixed in that new version of the software and so I need to split into [2, 3] and [3, 4], the one with the greatest epoch is no longer the best choice. It’s also strange that releasing the new [2, 3] revision would appear to work but actually have no effect on anything.

Unreleasing would be one way to solve the problem, but that’s not currently a concept that exists, partly because it’s not clear how that should interact with channel history, particularly when it comes to gatings and closed channels.

Well, it needs to exist no matter what we agree on here. People need to be able to pull out things that were released by mistake for other reasons. My undertanding was this was already possible. If it’s not, let’s please put that on the agenda.

Unreleasing hasn’t been interesting so far; only tip of a channel is significant today, so you can just re-release the revision that you want clients to have. But the ability to unrelease becomes important once epochs cause us to start considering non-tip revisions, so we’ll put it on the list to discuss at the next sprint.

We’ll proceed with implementing option 2 (greatest epoch wins) for now. There are still some unresolved issues around what happens when a channel is closed, but they’re not blockers for the next stages of the server-side implementation.

Even without epochs that’s already not true. There are reasons why people can artificially lock a revision into availability as long as it has been released publicly and not unreleased. For example, the proxy can tune revisions based on that public history, and refresh-control can also be used to do the same.