Proposal: architectures keyword for snap CI systems

Background

Snapcraft currently supports an architectures keyword that one can use to specify the architectures on which the snap runs. However, this has proven to be a confusing feature, for a few reasons:

  • The same snapcraft.yaml file is used to build the snap for multiple architectures, and yet if one used the architectures keyword, each of these snaps (regardless of the architecture on which it was built) claimed it ran on the same set of architectures.
  • It’s not at all uncommon for someone to provide a list of architectures in the snapcraft.yaml using this keyword, and then build the snap on build.snapcraft.io. The expectation is typically that it will build the snap for the architectures “requested”, when in fact that’s not the purpose of this keyword, nor does build.snapcraft.io use it in this manner.

Proposal

The proposal is to rework this keyword to better match user expectations. The architectures keyword will be restructured into a list of more explicit objects, specifying both build and run architecture(s):

architectures:
  - build-on: [<build arch 1>, <build arch 2>]
    run-on: [<run arch 1>, <run arch 2>]

Note that, while both of these can be a list, if the list is a single item they can also be simplified to a scalar (e.g. build-on: amd64). The default value for run-on is the value of build-on.

This keyword will remain completely optional. By default, Snapcraft will continue to build a snap that claims it runs on the same architecture as the build environment. Also, similar to what it does today, it will support all as a valid run-on architecture to denote a snap that can run everywhere (e.g. a snap that is only shell scripts).

Regarding CI systems (such as build.snapcraft.io), they can use this keyword to determine the architectures to use to build this snap: if none are specified, build all architectures. Otherwise, build only the ones specified.

Another piece to this puzzle when it comes to CI systems is the concept of a “build set”, defined to be the set of snaps built from the same snapcraft.yaml at the same point in time (the same commit, if we’re talking git). This set of snaps could be managed as a set instead of needing to manage each revision on its own. This proposal assumes that the CI system will fail the entire build set if any one of the builds fail (e.g. given a build set of amd64, i386, and armhf, if the armhf build fails, the entire build set is considered to have failed regardless of whether or not amd64 and i386 succeeded). To that end, one can use build-error: ignore to indicate an experimental/in-progress architecture that should be counted as part of the build set if it succeeds, but not cause the rest of the build set to fail if it fails.

Examples

Example 1

architectures:
  - build-on: i386
    run-on: [amd64, i386]
Snapcraft’s interpretation

If running on an i386 host, Snapcraft will build a snap that claims it runs on both amd64 and i386. If running elsewhere, Snapcraft will follow its default behavior, building a snap that runs on the build architecture.

CI systems’ interpretation

As there is a single non-scalar object in this list, CI systems know to produce only a single snap. Checking the build-on key, they know that it needs to be built on i386.

Example 2

architectures:
  - build-on: amd64
    run-on: all
Snapcraft’s interpretation

If running on an amd64 host, Snapcraft will build a snap that claims it can run on all architectures. If running elsewhere, Snapcraft will follow its default behavior, building a snap that runs on the build architecture.

CI systems’ interpretation

CI systems can assume that the user only wants the snap built on amd64.

Example 3

architectures:
  - build-on: amd64
    run-on: amd64

  - build-on: i386
    run-on: i386

Which is the same as:

architectures:
  - build-on: amd64
  - build-on: i386
Snapcraft’s interpretation

As far as Snapcraft is concerned, this is no different from its default behavior.

CI systems’ interpretation

CI systems can assume that the user only wants the snap built on amd64 and i386, and the resulting snaps are to be considered a build set (e.g. if amd64 succeeds but i386 fails, the entire set should be considered to have failed).

Example 4

architectures:
  - build-on: amd64
    run-on: amd64

  - build-on: i386
    run-on: i386

  - build-on: armhf
    run-on: armhf
    build-error: ignore
Snapcraft’s interpretation

Again, as far as Snapcraft is concerned, this is no different from its default behavior.

CI systems’ interpretation

CI systems can assume that the user only wants the snap built on amd64, i386, and armhf. While the resulting snaps are considered a build set, armhf may fail. If it does, release the rest of the build set as normal (i.e. don’t fail the entire build set if armhf fails). If amd64 or i386 fails, however, still consider the entire build set to have failed.

Example 5

architectures:
  - build-on: [amd64, i386]
    run-on: all
Snapcraft’s interpretation

If building on amd64 or i386, Snapcraft will produce a snap that claims it runs on all architectures. If running elsewhere, Snapcraft will follow its default behavior, building a snap that runs on the build architecture.

CI systems’ interpretation

There is only a single non-scalar item in architectures, so CI systems know there is only a single snap to be produced from this, and the resulting snap will claim it runs on all architectures. However, the snap author has specified that either amd64 or i386 could be used to produce this snap, which leaves it up to the CI system to decide which architecture to use. Which one has a smaller build queue?

Example 6

architectures: [amd64, i386]

Which is the same as:

architectures:
  - build-on: [amd64, i386]

Which is the same as:

architectures:
  - build-on: [amd64, i386]
    run-on: [amd64, i386]
Snapcraft’s interpretation

If building on amd64 or i386, Snapcraft will produce a snap that claims it runs on both amd64 and i386. If running elsewhere, Snapcraft will follow its default behavior, building a snap that runs on the build architecture. Note that this is a different interpretation of the currently-supported syntax.

CI systems’ interpretation

There is only a single non-scalar item in architectures, so CI systems know there is only a single snap to be produced from this, and the resulting snap will claim it runs on both amd64 and i386. However, the snap author has specified that either amd64 or i386 could be used to produce this snap, which leaves it up to the CI system to decide which architecture to use. Which one has a smaller build queue?

Example 7

architectures:
  - build-on: amd64
    run-on: all

  - build-on: i386
    run-on: i386
Snapcraft’s interpretation

Technically Snapcraft could work with this, and treat it similarly to Example 5. However, in this proposal it is an error, mostly to inform the user because of the CI systems’ interpretation of this.

CI systems’ interpretation

There are two non-scalar items in architectures, which implies that two snaps will be built. However, one of the snaps to be produced would claim it runs on i386, while the other would claim it runs everywhere (including i386). That means they would both be released to i386, which is likely not what the developer intended (since the user will only receive the latest). This is an error case.

1 Like

@kyrofa This seems to miss any references to the lenghty conversations we had on the topic, and what the advantages in comparison to the final status of these proposals. It also feels pretty confusing on first sight, in the sense that amd64 doesn’t run on all architectures, although that’s exactly what the syntax states.

Let’s please not move forward with these changes before we can synchronize and unify these two conversations.

This somewhat follows conversations from the Capetown sprint of having the pattern of

  • where it can build,
  • where that build can run,
  • where is it acceptable to have a build failure

Following that pattern, we had also discussed cross compilation and that was a matter of leaving it to something done through the CLI.

All that said, I do think this language does convey that quite nicely given the requirements and constraints stated above from that conversation in Capetown.

So for

it can build on amd64 and that snap that was built can run on all architectures

and for

it needs to build on i386 and outputs a snap that can run on amd64 and i386

While

indicates it is ok to fail on that architecture.

My point is that it ignores the conversation without any rationale for why. Even if we don’t end up with the syntax proposed there, the conversation was extensive and fruitful, and it’d be wise not to ignore it.

What you read diverges from what’s written. That very literally says “architecture amd64 runs on all”.

Let’s please meet to discuss these ideas in a place with more bandwidth.

@niemeyer and I had a nice chat about this a few minutes ago, and agreed to tweak the syntax proposed here to turn it into more exact/readable list of objects instead of a single object with multiple keys. I’ve updated the proposal to reflect (remember you can hit the pencil icon to see the old proposal, as well as the diff).

Thanks for writing the proposal down, Kyle.

Looking at the repetition in some of these cases got me thinking if we should further specify that an entry such as:

architectures:
    - amd64

is exactly equivalent to:

architectures:
    - build-on: amd64
      run-on: amd64

With that, we blend the two syntaxes and turn one into a special case of the other. This would also support mixing, so that we can have three simple entries, and one special for ignoring errors for example.

That avoids much of the repetition in the samples above.

@niemeyer well, that’s the syntax we support today, but your suggestion changes its meaning. The current meaning of that syntax given in the NEW syntax is this:

# old:
architectures: [amd64]

# new equivalent meaning
architectures:
  - build-on: all
    run-on: [amd64]

Granted, it doesn’t change it much, but it’s enough that it worries me. I don’t want to break people already using this keyword.

It does change its meaning, but this is a larger problem that we cannot avoid, in the sense that either we make it awkward because the two different syntaxes mean something completely different, or we adopt the different syntax in the two cases.

Before making a decision, we might ask the store team to have a quick look and see how many of our snaps are multi-arch today. This will give us an idea of the actual impact.

The store doesn’t have access to the publisher’s intent using the current syntax (snapcraft.yaml), just the resulted revisions. Considering that multi-arch means: the same snap revision (blob) targeting more than one architecture. We have about 50 snaps (snap_ids) with at least one revision in this scenario.

In that list there are snaps that were not updated for more than one year, some that were never released to stable and some that were already modified to build architecture-specific revisions. The list can be refined as needed, but at this level (50) manual review might be more effective.

Also note that, due to the lack of a snapcraft.yaml, this list doesn’t include any snaps where the developer uses a single architecture in the architectures field (e.g. architectures: [amd64]).

It may be worth pointing out that changing the meaning of existing syntax will have impact in other ways, too: existing documentation, any blog posts out there referencing this behavior, etc. Our ISVs tell us that breakage slows down their flow tremendously, I naturally like to avoid that wherever possible.

I am conscious that we will break, we should try not to, but the current behavior is already broken and having it behave differently already requires us to update documentation. This is also a rather obscure feature.

With all that said I do agree with @niemeyer. We could compromise by using a new keyword and give up the nice name of architectures as part of a feature we have had a long desire to kill (for which in the case of using a new keyword we will only deprecate with warnings), but given how obscure this is, is it really worth it?

Alright, I am also conscious of the significant conversation that has already happened around this feature. I’m okay with being outvoted! I’ll get started on the change.

So are we good to move forward with the implementation? @niemeyer I believe you made some final comments on IRC which I have missed.

I’m still trying to avoid the syntax issue mentioned above without introducing the ambiguity.

How about this minor tweak in the proposal we previously discussed that was detailed by @kyrofa above:

We change the default value when run-on: is missing so that it matches build-on:. In other words, this:

architectures:
    - build-on: [i386, amd64]

becomes equivalent to:

architectures:
    - build-on: [i386, amd64]
      run-on: [i386, amd64]

which is also equivalent to:

architectures: [i386, amd64]

This may sound pointless since that latter form is already supported, but the point of doing that is actually having this:

architectures:
    - build-on: i386
    - build-on: amd64

being equivalent to:

architectures:
    - build-on: i386
      run-on: i386
    - build-on: amd64
      run-on: amd64

That default for run-on is more useful than having all as the default, since given the rule that an architecture can only appear once in only one of the run-on statements, by definition we can only have all when we have a single architecture entry for the snap.

How does that sound?

1 Like

This does flow in a natural way in the sense that it feels intuitive to think that the snap built could run where we say it can build.

To be on the same page wrt run-on: all, does this mean that,

architectures:
    - build-on: amd64

or

architectures:
    - build-on: [amd64, i386]

have an implicit run-on: amd64 (or run-on: [amd64, i386]) and that if it does need to run everywhere, run-on: all should be specified explicitly?

Right, the implicit is always to just copy it over, so it’s very intuitive. In that example, this would be equivalent to:

architectures:
    - build-on: [amd64, i386]
      run-on: [amd64, i386]
1 Like

Wouldn’t this cause the builder to spin an amd64 machine AND and i386 machine, and BOTH of those machines produce a package that claims it runs on both amd64 and i386? That doesn’t sound desirable, because it will produce two supposedly identical packages but built differently…

I think the default should be to run-on only the architecture of the builder when nothing is specified.

So it would look like:

architectures:
- build-on: [amd64, i386]

# converts into equivalent of:
architectures:
- build-on: amd64
  run-on: amd64
- build-on: i386
  run-on: i386

These snippets are both valid, but they convey different intentions, so we cannot map one into the other. The first says build “on either amd64 or i386 something that can install on both amd64 and i386”, while the second says “build on amd64 something that can run on amd64” and “build on i386 something that can run on i386”.

A good quality of that syntax is that I pretty much read what was written down.

So you’re saying that build-on: [amd64, i386] will only build one package using whichever builder is first to respond of one of those architectures? i.e. it will not build on both architectures for each commit…

That is correct, this is a variation of Example 5 in the original proposal.

@kyrofa mind updating the proposal with the latest agreements?