Epochs (stepped upgrades)

This might be a catalyst to get us to refactor the metadata endpoint (create a new endpoint or 2) to accept more context and handle all install, refresh, and refresh with revision cases. IMHO /details should only be used for snap info

2 Likes

@pedronis @noise unless I’m missing something the new endpoints would avoid the client downloading a particular revision because the user explicitly requested it, only to realise it’s not possible to install it, right? In other words it would enable the error to be synchronous instead of asynchronous, which is a good thing, but could be seen as a perf tweak?

Or are there cases that should behave differently than what we’re going to get with the current endpoints, considering the client is going to be checking the rules before installing anything?

Some considerations:

  • epoch numbers are always going to be non-negative
  • 0* is an invalid epoch expression and will throw an error
  • as will trying to use hex or octal (or roman numerals etc) for the epoch expression
  • an empty list in the expanded version is also an error
  • as is a list that isn’t in ascending order

and three questions:

  1. it’s an error if read and write do not share at least one value, right?
  2. Is it an error if the epoch includes random other stuff? E.g.
    epoch:
      does-not-rhyme-with: epic
    
  3. the lists are yaml numbers, so hex and such will pass — or should I force them to be decimal? it’ll mean more work both in the unmarhsaler and in documenting the fact, but it means we’ll never have to deal with developers making things harder for themselves by doing
    epoch:
     read: [0xC],
     write: [11],
    
    or
    epoch:
     read:
      - 07
      - 08
      - 09
      - 10
    

A different kind of question: if I have a snap installed, to upgrade do I need to be able to read all the epochs the current snap writes?

  1. Indeed as it would mean the snap is incompatible with itself.

  2. I suggest ignoring unknowns and letting snapcraft catch such issues and report at build time. The problem otherwise is that there’s no way to build in the future a snap that accepts more data without it being incompatible with older snapds. On the other hand, we can always force in the future to have a snap that doesn’t install in the past via the assumes mechanism.

  3. I wasn’t so concerned about the format of ints used, but your second example is a bit of an eye-opener as it looks good and is wrong. As it’s cheap for us, might be worth forcing people to use plain ints.

  4. No, that’s the proposed rule number 1 above, and it’s also the reason why in your first question you bring up sharing at least one value and not all of them:

1 Like

another question: should we accept read: 5 as shorthand for read: [5]?

Doesn’t seem worth it. We have a short syntax that should cover the majority of simple cases. That for example will most likely be simply epoch: 5 alone. Once people decide to skip it and go full syntax, we can ask for more verbosity and precision.

1 Like

The obverse of this is that 010 in yaml would typically be 8, but if we define them as decimal numbers then it’s a 10, and this will surprise a different subset of developers.

If we disallow we need to reject rather than parsing them differently.

The new Epoch complex structure (two lists for read and write capabilities) will be used in the new APIs that are being discussed here.

In the old API (that is being used by current and previous snapds, we’ll do the following:

  • keep the epoch field as a string both in the request sent from the client to the server, as in the server responses

  • the refreshes will only filter by Epoch by getting the latest revision with the same epoch that was indicated by the client (no “epoch evolution” using the old API)

For the record, the issue that motivates the design @facundobatista points out is that old snapds cannot handle rich epochs, and do not understand the epoch semantics at all, so the suggested behavior ensures an old snapd will remain inside its comfort zone, so to speak.

While defining the work required for epochs, we noted that epoch filtering introduces a new concept we haven’t needed before. When resolving which revision should be returned for a snap refresh, we need to pick the next best in its epoch upgrade path. And for that, we need to go through the timeline of releases in the channel the client is tracking. But what is a channel timeline?

There are different possible interpretations for it, but let’s consider the one we are thinking of applying:

  • While the channel is open and getting releases, the timeline seems clear: all revisions released to that channel over time.
  • On the other hand, if a channel is closed and follows another, its history is combined with the history of that other channel (while the channel is closed); if the channel is reopened later, it stops sharing the other channel’s history since then, but it preserves as part of its timeline the releases of that other channel while it was following it.

Then, this means that a refresh could get as a result a revision that was released to a different channel in the past (when filtering the revisions on which we would be apply epoch filtering).

For example, given the following sequence of releases:

Release r1 to edge
Release r2 to beta
Release r3 to beta
Close edge
Release r4 to beta
Release r5 to beta
Release r6 to edge (ie. reopen edge)
Release r7 to beta

The timeline for edge at this point would be:
r1 -> r3 -> r4 -> r5 -> r6
(note how we are including releases to beta while edge was closed and following beta)

If we extend the example above introducing epochs:

    read    write
r1  [1]     [1]
r2  [1]     [1]
r3  [1,2]   [1,2]
r4  [1,2]   [1,2]
r5  [2]     [2]
r6  [1]     [1]
r7  [2,3]   [2,3]

Assuming a client tracking the edge channel doing a refresh from revision r1, if we follow the described timeline, it would get r4 as result.

The timeline definition will decide how to resolve epoch filtering when one (or more) closed channels are involved and that’s why we want to agree on what it means before getting to the implementation.

It feels like we are making this too complicated. I understand the desire to recreate history exactly as it happened, but don’t feel it is necessary in this case.

The intent of epochs is to get you from point A to D by way of B & C if necessary, not to follow an explicit trail of A->B->C->D, if C was only released to a followed channel but the path A->B->D is valid we are OK*. the problem only arises if an epoch-compatible revision was never released in the given channel, but i’d argue that’s an oddity in the publishing sequence.

Simply following the explicit list of revisions released to a channel (and ignoring following) feels a lot clearer to me from a publisher viewpoint and not a difference that an end user will ever see.

*To clarify in my above example I’m implying we are on edge, A, B, & D were released to edge, C only released to beta, and edge was closed at some point while C was in beta.

@noise That’s not really what we are aiming for, at least not without it being more specific. We do care about the sequence and the history, including the intermediate bits. That’s what gives people the ability to control what is the exact release that will be visible and jumping through an epoch, and all intermediate epochs that were published into the channel should be respected as well.

There are indeed a lot of details above about epochs, and they remain relevant. The simplicity here can’t break down the rules we established above and explained the rationale for.

With that said, I’m out of context. So let me read @matiasb detailed coverage of the issue and try to understand what is the problem to be solved.

Right, sorry i wasn’t clear - but I wasn’t saying to ignore the epoch chain, just to ignore followed revisions from another channel while gathering the links.

@matiasb Ah, that’s an interesting edge case indeed, and it creates a pretty convoluted picture if we account for the fact that we support multiple levels of follow ups.

The best possible solution would be to reflect exactly what a client would see in reality. In the example, the closing of edge while it was holding r1 is exactly equivalent to a client as if edge had observed r3 being released into it, and then r4 and r5. Then r6 gets back into edge, and that indeed forms the history you described:

r1 -> r3 -> r4 -> r5 -> r6

That means no matter when one got to follow the edge channel, they’ll observe exactly the same history. Good invariant to hold.

That’s sounds right. I expect this would make the implementation a bit more complex, but the pay off is that people won’t have to think and fix awkward situations by hand, which pays off.

While trying to figure out the best behaviour solution for release history across channels, we found that we need to better understand which behaviour we want when selecting the revision to return in the case of multiple revisions available, even for the same channel.

IOW, when multiple revisions are OK to be returned after applying the Epochs restriction, which one should the server choose?

There are three options:

  1. the most recently released revision with the epoch wins
  2. the most recently released revision with the epoch and the greatest other epoch wins
  3. the most recently released revision with the epoch and at least one greater epoch wins

The first option is the simplest one. Of all the revisions that are Epoch ok, let’s just get the latest one. This option is not only simple, but also in line with current “latest is what will be returned” pre-epochs behaviour, so it’s easy to think about.

However, this one is unworkable. Let’s see this example: let’s say we have the following releases (assuming same epochs in read and write, for simplicity):

r1    [0, 1]
r2    [1, 2]
r3    [2, 3]

If we boot first time a device that has epoch in 0, it will refresh to r1, then to r2 and finally to r3. At some point we found that r1 has a bug, so we fix it and release the corrected code in r4 (of course with epoch=[0, 1]). If we now boot first time another device in epoch 0, it will refresh to r4, and will never go to r2 or r3, getting stuck.

The second option is more complex but avoids the problem just described. It’s the currently defined behaviour (per talks we have had in this post in the past). The downside of this approach is that the revision selection is rigid. See this example, we have:

r1    [0, 1]
r2    [1, 2, 3]
r3    [1, 2]

When a device with epoch 0 needs to refresh it will go to r1, and then from r1 to r2 (as it has the greatest epoch!). No matter what revisions we release in the future, the only way for the device to upgrade to something different is to release a revision with a number greater than or equal to 3 in its epoch list.

The third option is slightly different, as a revision may be selected even if it doesn’t have the greatest epoch, but as it has at least one greater it allows an upgrade path to be established. See the following example:

r1    [0, 1]
r2    [1, 2, 3]
r3    [1, 2]
r4    [0, 1]

In this case the device bootstrapped with epoch 0 will end up also in r2, but through a longer sequence of refreshes. It first goes to r4 (not r1, because r4 has a higher epoch (1) than the matching one (0), and it’s released later than r1). In the next refresh it will go to r3 (which beats r4 as it has one higher epoch than the matching one (1), and beats r2 (which also have higher epochs) because it was released later). And finally it will refresh to r2, because now it’s the only one with “exceeding” epochs beyond the matching ones.

So, which option you think it’s the best? Thanks!!

Isn’t this described in rule number 2 of the original rule set presented in Aug/2017 above?

  1. The store always offers the most recent snap able to read the highest epoch among those that may be installed (see rule 1 and 6).

This still seems fine for your example. It will install r4, which is the only one that may be installed given we still have data at epoch 0, then r2, which can now be installed and is able to read epoch 2, and then r3, which may now be installed and can read epoch 3.

That case as specified in the original ruleset has the flaw that it’s possible to make a mistake from which it isn’t possible to sensibly recover. If I release a revision for [2, 3, 4] and later realise I need separate [2, 3] and [3, 4] revisions, the [2, 3] revision will never be picked as the ancient [2, 3, 4] revision will have precedence. We don’t have other situations today where you can get yourself into a corner like that.

One can always unrelease the snap if it was an actual mistake, which takes them out of the corner. When it is not a mistake, though, the logic seems right: the [1, 2, 3] is indeed preferable to something that can only read [1, 2]. The former allows a revert (downgrade), while the latter doesn’t. It’s also just more intuitive… the data format rolls forward, and hop points in the middle will need to support more than one format on the way to the current tip that is higher up. The ones with more compatibility are better hop points than the ones with less compatibility. Both [1, 2, 3] and [2, 3], and even [3], look to me like more recent epochs than [1, 2], and thus a more natural target.