Keeping up with SPDX license identifier updates

Hello folks!

As you all may know, the snap ecosystem uses SPDX license identifiers to represent a snap’s licensing information in a concise and machine-readable way, which also allows for validation of SPDX expressions.

The snap store and snapd maintain a list of SPDX license identifiers (https://github.com/snapcore/snapd/commits/master/spdx/licenses.go for example) which was, I believe, snapshot at a time when SPDX 2.x was current, and one of the reasons for having this internal list (the store has a similar one) is that SPDX did not provide a machine-readable list of identifiers.

This changed with SPDX 3.0 which came out in 2018: the good news is that they now provide said json list with identifiers and some other properties of each license.

The bad news is that SPDX 3.0 also introduced new license identifiers, which is how we learned about it, when someone complained the store wasn’t accepting what they thought was a perfectly valid identifier (FWIW snapd would also have rejected those identifiers).

Since SPDX now provides their identifier list in json format, and (AFAIU) they also guarantee existing identifiers won’t change or be removed (they can be marked as deprecated though), it made trivial sense to fix the above store bug by pulling the list of json identifiers periodically, so the store would be up to date and accept the latest ones.

However, a change like this must be considered in the context of the entire snap ecosystem and toolset. The store can’t trivially just start accepting and publishing snaps with license IDs that snapd does not recognize, because then people will be unable to install them. Just for absolute clarity, I have NOT done any such changes in the store until we’re all clear and agreed on what to do.

In the past we discussed using snapd as the central SPDX validation engine, so both snapcraft and the store would call a hypothetical “snap validate-license-expression” command to ensure uniformity. In and of itself, this solution would not help with constantly-updating SPDX identifiers, because the store’s snapd copy would still need to be updated periodically (which also necessitates constant snapd updates), and also because users in the wild are still exposed to their snapd not being current with what the store has, and receiving an expression their snapd can’t parse.

So the point of this thread is to discuss with the interested parties how best to proceed.

In addition to the store bug I mentioned above, I filed this snapd bug describing the issue, and in there I mention a solution which Bret came up with, which keeps the validation engines separate as they are now, but uses the store as a central repository of license data which snapd can sync to, when needed. To repeat the proposal, which is absolutely a strawman and can be refined, modified or entirely discarded:

  • The store will use the latest version of the SPDX license list from the location noted above. We will update our version on every store rollout (happens several times a week).
  • Since snapds in the wild are not necessarily always in sync and up to date, there is always the possibility snapd will receive from the store a license expression with unknown (read: new) identifiers. Even having the store use snapd as the validation engine would not remove this possibility.
  • So snapd could get/refresh the list of known identifiers from the store. The store will have a verbatim copy of the .json files from spdx.
  • To avoid excess traffic, Bret suggested: 1- when trying to validate a snap’s license, use the local data 2- If an unknown identifier is found in an expression, try fetching the latest data from the store, and retry the validation (which should now pass). Cache that latest data to keep the local license list updated. 3- If the validation still fails, then it is a bogus expression; show the appropriate error.

We also need to consider the sideloading case (e.g. snapd could maybe have a cached list in the event of sideload or whatever). A problem with sideloads is that if a snap with a newish license expression is installed and snapd has no store access, it will be unable to update its expression. In this case I would suggest just saying it is an unknown license to this snapd.

In any case, snapd should have an initial, seeded list of licenses which should be updated periodically so the disconnected and unfrequently-updated cases don’t fall too far behind.

What do you think? I’m looking forward to working together to come up with a good ecosystem-wide solution to this issue :slight_smile:

  • Daniel

PD: I filed this in the snapd category but tell me if it’s better to put it in the store category. Unfortunately cross-category posts don’t seem to be allowed :frowning:

2 Likes

Does snapd really need to be strict on the license meta? Could we not have strict conformance of license meta in the store but at snapd’s end just display whatever it can - if it doesn’t know what the license meta means then display it verbatim, after sanitisation for potential nasties… If it does know then display a richer version such as a human-readable description.

1 Like

I wonder if the SPDX database should be separated into independent Go library or something…

while there’s a lot that we’d rather have split out from the snapd codebase, each additional library we need to pull in makes cross-distro packaging of snapd itself harder.

1 Like

if we started lax, and then found we needed to go strict, it’d break for a lot of people. If we start strict and then relax what’s needed, nothing breaks.

1 Like

Thanks for these ideas. I wonder how often new spdx identifiers really get added and if we really need much process around this. My suggestion would be that we simply update the licenses in snapd every once in a while (e.g. via https://github.com/snapcore/snapd/pull/6152). We release snapd every 4-6 weeks and usually people get updates reasonable quickly so hopefully this is enough.

Hi, I’m glad to see discussion starting over this topic! Licenses do evolve and we should be ready for them to change.

https://github.com/spdx/license-list-data/commits/master/json/licenses.json says about every few days to a couple of weeks. I didn’t check all of the commits are indeed license updates though, but I don’t see why they’d update the license.json if it wasn’t for a worthwhile change.