Epochs (stepped upgrades)

Recorded a new whiteboard session describing our current plans for the upcoming epochs feature:

https://youtu.be/Q1pyuKYQC-g?width=690&height=388

Please let us know if you have any comments.

2 Likes

Thanks Gustavo for all this info.

I have some questions.

  1. Around 13’09" you show the case where there is no epoch 1*, user is in v6.0 (e0), there is also v6.5 (e0) and v7.0 (e1, and the one released in the channel). In this case, if user requests a refresh, it will be left in v6.0? or it will go to 6.5? (considering that 6.5 was released in the past to the same channel… IOW, it would go to the latest revision in the same epoch that was released to the requested channel).

  2. This “star syntax”, writing epochs with numbers that may be followed by one or two asterisk(s), is the syntax that would be used in the Snapcraft’s YAML? Or this is a high level nomenclature and will define later how it will be actually written in the project’s config?

  3. The way you explained the double-star epoch, it would flag a revision in epoch N as capable of support some internal-whatever-format used in epoch N-1. This is ok, but it’s weird that that version of the software would need to work with two versions of the format at the same time (operating on the old format and the new format simultaneously), as at any time the user could jump back to N-1. I’m wanting to be explicit in this case, because at first some of us thought about a different behaviour here, that may be more flexible, which is the following: a single star revision would support a “forward migration” of internal data, and double-star revision would support a “backward migration of it”, so when you release 7.5 with epoch 1* (which is capable of migrating format old to new) you would also release 6.7 with epoch 0** (which is capable of migration from new format to the old one).

  4. The following are two cases that if I understood correctly should be allowed by the system, having a scenario of 6.0 (e0), 6.5 (e0), 7.5 (e1**), 7.8 (e1):
    a) 6.0 -> 7.5 -> 7.8
    b) 7.8 -> 7.5 -> 6.5
    are these ok?

  5. When jumping from a double starred epoch to before… it should just jump to the latest revision of the previous epoch, or there is another rule to consider? For example, if the snap is gated that revision must be validated, that’s obvious, but what about being the latest revision ever released in the requested channel?

They would refresh to the latest in the current installed epoch, v6.5 in this case.

@wgrant had suggested possibly having multiple fields. He can chime in but I think the idea was something like:

epoch: 2
reads_prior_epochs: 1 (this is the 2* case)
writes_prior_epochs: 1 (this is additionally 2** case)

this allows supporting upgrading from more than just N-1, for example:

reads_prior_epochs: 0,1

so you could jump straight from 6.x to 8.x or whatever.

This would allow a situation, for example, where the developer migrates format and updates epoch a couple of times in edge/beta, and when it reaches stable it really would prefer for the clients to jump from epoch N to epoch N+3 right away.

Other question: will we allow the developer to release a revision using an epoch that is not sequential to the previous one?

Let’s say that currently she’s using epoch 0 (the default now), and when starting to use this feature, the developer would prefer to start with epoch 7 that matches her “internal config format 7”.

So, would we allow her to release a revision using epoch 7, or we will forbid that, and epoch bump would be forced to be into 1 (starred or not)?

I created this doc making explicit a lot of cases in several scenarios, please comment on anything that is wrong so we can discuss specific cases forward.

Thanks!

Not the latest revision. It goes to the most recently released revision in that channel and that it can refresh to.

It was the suggested syntax, but after thinking about it over the weekend, @wgrant has a good point in wanting to allow multiple compatible epochs for reading. I can see that being essential in large scale deployments of databases and similar.

I’m still trying to come up with the appropriate syntax that would support this without being too confusing. In particular, we need to avoid any ambiguity in terms of what the current epoch is, because that establishes a line of updates that must be clear. I’ll come back with a more concrete proposal here.

As I described, I wouldn’t worry to much about the use case for this flag, as indeed it will require quite specific coding in most cases. My main concern with that feature is ensuring that the intended behavior is clear.

That doesn’t solve the actual problem we’re trying to solve, which is ensuring that every single refresh has a chance of reverting. The careful logic has to be in the future, because we can’t prevent past revisions from attempting to reach it. In other words, releasing 0** doesn’t change the fact you have 0 around still.

No, b is not okay. After we go to 7.8 with e1, the double-star behavior is gone because we stepped into a revision that wasn’t trying to preserve compatibility with the prior epoch, so we can’t go back anymore. It can go back from 7.8 to 7.5 because 7.5 is e1, but it cannot go back further.

We should never refresh to a prior epoch automatically, so the only way to reach that scenario is because a refresh failed and reverted automatically, or due to a manual revert. In both cases the revision being reverted to is explicit and local. In every other case, after the client gets epoch e1, it won’t automatically request anything other than e1 or larger.

Here is the content from the spreadsheet inline for better tracking and communication around it.

Still need to go over it.

Scenario 1

Simple and normal refresh sequence, all revisions released, no gatings, all in same channel than what’s refreshed.

Revision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Epoch 0 0 0 1* 1* 1* 1 1 1 2 2 2* 2 2* 2 3 3

Cases:

Case From To Summary
1.1 3 6 From last revno of epoch 0 to latest 1*
1.2 2 6 From not last revno of epoch, to latest 1*
1.3 4 14 From any 1*, to latest revno of 2*
1.4 7 14 From any 1, to latest revno of 2*
1.5 9 14 From last revno of epoch 1, to latest 2*
1.6 10 15 From any 2, to latest revno of 2 (as there is no 3*)
1.7 12 15 From any 2*, to latest revno of 2 (as there is no 3*)
1.8 15 15 No update if there are no starred revisions in next epoch

Scenario 2

The snap being gated, and not all revisions are validated.

Revision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Epoch 0 0 0 1* 1* 1* 1 1 1 2* 2* 2 2 3* 3* 3 3
Validated Y Y Y NO NO NO Y Y NO Y Y Y Y Y NO Y Y

Cases:

Case From To Summary
2.1 3 3 No update, all starred revisions in the next epoch are not validated
2.2 8 11 Jump to latest validated starred in the next epoch (simple case where all are ok)
2.3 13 14 Jump to latest validated starred in the next epoch (case when the latest is not validated)

Scenario 3

Epochs and revisions are “mixed”.

Revision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Epoch 0 0 0 1* 0 1 2 1* 1 2* 1 1 2 3* 3* 2 3
Release sequence 1 2 3 4 5 6 7 8 9 11 10 15 12 16 13 14 17

Cases:

Case From To Summary
3.1 3 8 To latest 1* (no matter what’s in the “middle”)
3.2 11 10 To latest 2* (no matter if it’s a lower revision, there is no order by revision really)
3.3 16 14 To latest released 3* (see that revision 14 was released after revision 15)
3.4 12 12 No update to another epoch, as r10 (2*) is released before r12, and newer releases are too far away

Scenario 4

Refreshes to some specific revisions (not latest by default) |

Revision 1 2 3 4 5 6 7 8 9 10 11 12 13
Epoch 0 0 0 1* 0 1** 1* 1 1 2* 1 2** 2

Cases:

Case From Req Got Summary
4.1 3 4 4 Ok, requested revno is in epoch N+1 starred
4.2 3 8 ERR Rejected, requested revno is not starred
4.3 3 6 6 Ok, double-starred revisions “include” the simple-star semantics
4.4 3 10 ERR Rejected, can not jump two epochs
4.5 7 5 ERR Rejected, can not “jump epochs backwards” if not double-starred revision
4.6 8 5 ERR Rejected, can not “jump epochs backwards” if not double-starred revision
4.7 6 8 6 A normal forward refresh (even if not to latest), no matter it’s starting from a **
4.8 6 13 ERR Next epoch revision must be starred; starting from a ** doesn’t change that
4.9 6 10 10 A normal forward refresh, changing epoch ok, no matter it’s starting from a **
4.10 6 8 8 A normal forward refresh (even if not to latest), no matter starting from a **
4.11 6 5 5 Jump epoch backwards from a double starred rev, landing in a simple one
4.12 12 7 7 Jump epoch backwards from a double starred rev, landing in a starred one

Scenario 5

Refreshes to some specific channel (not in the same one by default).

Revision 1 2 3 4 5 6 7 8 9
Epoch 0 0 0 1* 0 1** 1* 1 2
Channel stable stable beta beta edge beta beta edge cand

Cases:

Case From Channel Got Summary
5.1 3 stable 2 Latest revision in stable channel
5.2 2 beta 7 Latest revision in beta, ok because it’s epoch+1 and starred
5.3 4 stable ERR Rejected, as it would imply going backards in epoch not starting from a ** rev
5.4 6 stable 2 Latest revision in stable, ok because it’s epoch-1 starting from **
5.5 3 edge 5 Latest revision in edge (can’t jump epoch because no edge with 1*)
5.6 3 candidate ERR rejected, as it would imply a jump to a N+2 epoch
5.7 8 candidate ERR rejected, as it would imply a jump to another epoch, to a not-starred revision

Thanks for putting these in place, @facundobatista. Very useful for the conversation.

Here we go:

Scenario 1 – Looks good!

Scenario 2 – Looks good!

Scenario 3 – The 3.4 case looks bogus. The fact r12 (e1, #15) is released late just means it becomes the most recent snap on epoch 1. If this snap is installed it should still go to r10 (e2*, #11) and then to r14 (e3*, #16), and finally r17 (e3, #17). The underlying logic is that epochs never go backwards. After we release e3, that’s the most recent epoch and the one that will remain at the tip until something more recent comes up. That said, we do allow older epochs to be released, and they become the tip of their own epoch (most recent e1 in the example). In other words, the epoch order (0, 1*, 1, 2*, 2, 3*, 3) is preserved no matter what the release order is. If we didn’t do that, we’d be completely unable to ever release a new 1* after anything higher was released, as it would completely break the sequence.

Scenario 4 – Case 4.7 is wrong, but that’s probably just a typo in the table. It asks for 8 and should get 8. In fact, case 4.10 is just a double of 4.7 with the result corrected. Remaining cases look right, but there’s one more tricky case that needs to be accounted for and is not in the table: if updating from 6 to 8 to 6 to 5, the first two operations should work, and the last must fail because it went “through” a non-double-starred revision which means the compatibility promises may have been broken and the backwards compliance is gone. This logic is mostly in snapd itself rather than the store.

Scenario 5 – 5.4 is bogus because a double-starred epoch allows the revert to happen, but it doesn’t change the fact epoch 0 is lower than epoch 1**. In this sense, a double-starred revision works exactly the same as if this was a single-starred 1* epoch, and thus incompatible with epoch 0. The rule of thumb is that we never go backwards to previous epochs automatically. Update: Your suggestion actually sounds better. Please see details below.

Rest looks good.

One note about sequencing and revisions. While the separation of sequence vs. revision in Scenario 3 is useful to explicitly acknowledge the fact that what matters is the sequence of releases rather than the number of the revision, it also made the scenario pretty hard to track, and other scenarios actually imply that the revision is sequentially released. I’d suggest thinking about revisions as sequentially released just for facilitating the conversations around it, but explicitly mentioning that revision numbers are just identifiers and not used for ordering. All of these scenarios might have the revision numbers randomized and they’d still be valid.

1 Like

I’ll just toss around some more terminology options:

upgrade_from_epoch: 0,1

or

supports_prior_epochs: 0,1
1 Like

@facundobatista After sleeping on these points, I would actually like to retract one of the points I made above, as I think your suggestion was better than what we (or at least I) had in mind. Specifically, in case 5.4 the tip of the channel with the previous epoch should be installed. The alternative behavior I suggested, although seemingly more consistent and safer, is not helpful in that it will prevent people that experiment in other channels from going back to a different line of refreshes in the original channel holding that was holding a lower epoch.

So, restating more clearly: if the channel tip has a revision that might be installed, it should be installed even if the epoch is lower.

Note that this does not change the point made about case 3.4 above.

1 Like

My two cents on naming:

forward_epochs
backward_epochs

1 Like

Okay, so @wgrant’s idea of expanding the language to more clearly define which epochs are supported is sane. Per previous note, I was trying to simplify the problem and encourage people to always go through a well defined sequence of revisions, but while that’s more stable overall in that we force the one tested path, it will also be a relevant burden in some cases.

So I suggest we open up the language on both ends, both in terms of backwards compatibility and forwards compatibility, and allow defining clearly which epochs are supported. The behavior remains the same one we’re discussing above. We just extend the logic so that multiple epochs may be defined on either end, and when looking for refreshes it picks the most recent epoch supported.

That said, I’d like to keep the proposed syntax as well, as a shorthand notation, because it is easier to understand for the simple cases which are also the most common. Also, when people don’t know or don’t care to check, using the simple syntax means it will do the right thing and force people to go through the tested path, so we still get the benefit described above of encouraging the stable path.

Here is the proposed syntax with examples of some expressions and what those expressions would expand to.

Example Expression   Expanded
A (missing) epoch:
    read: [0]
    write: 0
B epoch: 1 epoch:
    read: [1]
    write: 1
C epoch: 2* epoch:
    read: [1, 2]
    write: 2
D epoch:
    write: 2
epoch:
    read: [2]
    write: 2
E epoch:
    read: [1, 2]
epoch:
    read: [1, 2]
    write: 2
F epoch:
    read: [1, 2, 3]
    write: 1
epoch:
    read: [1, 2, 3]
    write: 1

The proposed rules are:

  1. If the write epoch of the snap currently installed is present in the read epoch list of the candidate snap, snapd will accept the refresh.
  2. The store always offers the most recent snap able to read the highest epoch among those that may be installed (see rule 1 and 6).
  3. The default write epoch, when one is not provided explicitly, is the highest one in the read epoch list.
  4. The default read epoch list, when one is not provided explicitly, is a list containing only the write epoch.
  5. For clarity, the read field must be ordered when provided explicitly in the yaml file, which means the highest epoch is always the last one in the list.
  6. When a snap is pushed to the store it becomes the most recent and best candidate on all epoch reads it supports (addresses point discussed on case 3.4 above).

As previously stated, per those rules both the original syntax and the original single-star semantics discussed above are still fully supported, but they generalize the semantics so that compatibility across multiple epochs is supported both forwards (read list with multiple entries) and backwards (write epoch lower than previous one).

We’re dropping the double-star syntax, though, but preserving its semantics via the expanded syntax.

For complete clarity, this is the translation from the obsolete and unsupported syntax:

Expression   Expanded
# Syntax not supported!
# epoch: 2**
epoch:
    read: [1, 2]
    write: 1

Update: This example was fixed.

How does that sound?

So, the epoch analysis comes before the revision release analysis; IOW, on a refresh first needs to be considered if there’s an epoch upgrade available, and after that which revision to consider (inside the current or the new epoch) based on release history and the other details.

Yes, it was a typo, good catch! Thanks

This new way to express epochs is undoubtedly more flexible.

It raises more questions and doubts, let’s see:

  • we lose the “epoch number” as the sequence that needs to be followed… in the old way of thinking this, you were in epoch N, and you could stay in N or jump to N+1 (if starred, etc)… now it’s a matter of “capabilities”: which epochs the revision can read or write

  • because of what just said, for example, the rule of “not jumping more than one epoch” vanishes… if the client is in an epoch which writes “N”, it could jump to any other epoch that has N between the epoch that reads, no matter if writes N+1, N+2, etc

  • also, we lose the “epoch ordering”, which raises this doubt: if in a refresh we find several revisions to offer to the client, which one we select? the one with the higher number in “writes” attribute, or just the latest released one in the channel? As per your rule 2 above, we should even be ordering by the higher number in the “reads” attribute… is that right?

In general, what bothers me is that we would need to stop thinking Epochs as a “sequence of stages” (which is how we always thought it and the base concept in your video), and start thinking it as a “matching capabilities” which allows us to filter the existing revisions to answer from the Store in a refresh.

Is this concept change correct? A price to pay in complexity, buying the new flexibility?

If this is the path to pursue, I’ll rethink all the cases in the spreadsheet to ensure we have all covered.

@facundobatista Creating more scenarios sounds good, but let’s please not touch the existing ones other than fixing the issues discussed above. The rules really haven’t changed, and the existing scenarios are all still covered and still work the same way, other than the syntax for duble-stars being gone.

These points you raise are all covered in the few simple rules mentioned right below the examples, and mainly rule 1 and 2:

So while before the conversation was:

  • <snapd> I have epoch 2… what do you have for me on stable?
  • <store> Here is the most recent snap with epoch 3*.

Now it will be:

  • <snapd> I have epoch write 2… what do you have for me on stable?
  • <store> Here is the most recent snap with epoch read 4, which is the highest that can still read 2.

I’ll also add rule 6 below the existing ones to be more explicit about the issue we discussed around case 3.4 above. It will read:

When a snap is released to a channel it becomes the most recent and best candidate on all epoch reads it supports within that channel.

Actually, maybe we should to tweak those rules slightly to make the behavior more useful. Here is the problem: with the described rules, if a snap revision 10 with epoch read [2, 3] is pushed after revision 9 with epoch [2, 3, 4], revision 9 is still a better candidate because it reads epoch 4. This is not a problem on itself, but it means there’s absolutely no point in accepting a release of revision 10 after that revision 9, because it will never be used.

If we tweaked rule 2 to say:

2. The store always offers the most recent snap able to read the epoch write currently installed

The situation above would mean revision 10 becomes the candidate, which seems like a more useful outcome. If that’s not desired, then just don’t push revision 10 with that epoch.

This does change slightly the original semantics I presented above for case 3.4, though, because something declaring simply epoch: 2 and released later would take precedence over something declaring epoch: 3*. Maybe that’s okay, with the rationale that this event is rare and one can always release any revision again to adapt the order as desired. This also honors more closely the current semantics of the developer having more control over what is the tip that should be the candidate.

Hmmmm…

Okay, after pondering for a while about these points made right above, I think we can scratch that as the original rules defined before are still more sensible. The reason is that the case of having an epoch read of [2, 3] after publishing a [2, 3, 4] is too artificial. In reality if there’s a version of the software that can handle [2, 3, 4] at tip, the developer should just continue to release that instead of going back to one that cannot handle epoch 4 anymore.

The more realistic case of going back and forth is for example when there’s a snap revision 9 handling epoch [2, 3], and then revision 10 handling [3, 4]. In that case it is indeed useful to be able to release revision 11 that can handle [2, 3] again, to fix a serious problem on that old version of the software that was a stepping stone to reach epoch 4 (say, a security fix or a problem in the upgrade routine itself). For that to work without creating a mess, though, we need to maintain the original rules defined in the proposal. Otherwise releasing revision 11 would destroy the correct ordering of the updates.

So, the original proposal above still stands.