Stable core release delay retrospect

mvo · July 13, 2017, 2:09pm

Intro

This is a retrospect on why we have the current delay in releaseing a new stable core snap.

What happend?

After the 2.24 release we had no stable core with 2.25. Right before the release of 2.25 to stable the CE QA team found a bug when refreshing/reverting core from 2.24->2.25->2.24. The root cause for this bug is that in 2.25 new syntax and symbols for snap-confine were introduced. The way that snap-confine works is that it takes the textual representation of the rules and transforms that txt on-the-fly to binary bpf rules that the kernel understands. The new version 2.25 had symbols that the old 2.24 snap-confine could not understand. Due to a race condition in the security profile re-generation at startup this means that some system daemons (like network-manager) would try to be started with the new syntax and the old snap-confine would not understand them and hence the daemon would fail to start. In the case of network-manager this is of course very bad. The original plan was go just skip 2.25 and revert the new syntax/symbols in 2.26. This was done, however it turned out that the revert was incomplete. Due to the risk of regressions with further reverts the approach of generating the security profiles was re-considered (outcome in Versionized profiles). As a short term fix the seccomp profile approach was rewritten so that snapd would write the binary bpf code directly. This was snap-confine just needs to load that binary representation into the kernel eliminating the previous tight coupling of the profile with snap-confine. This also removed a lot of (suid root) C code from snap-confine. When this was ready CVE-2017-10600 happend and took a couple of days to sort out, delaying the snap-seccomp fix further.

What went wrong?

The fixing of the revert bug clearly took too long. Instead of skipping 2.25 we should probably have reverted the 2.25 seccomp syntax/symbol changes and continued with 2.25.1. Back when it happend it was expected that 2.26 would be right around the corner. However it turns out this assumption was wrong.

What did we learn?

How to make reverts and security profile generation more robust (https://forum.snapcraft.io/t/versionized-profiles/ is a direct outcome of this)
Seccomp profile generation is much more robust now
Much improved Core revert testing got added (including the scenarios that tests early boot daemons like bluez and network-manager)
Automated SRU validation tests added

What can we do better next time?

Never skip a release (like we did with 2.25), go back to it instead
Do more release candidate -rc versions of snapd and only if we are really happy with it, to the final 2.XY
Move core tests done in beta to edge

Timeline:

2017-04-11 release of 2.24 stable core
2017-04-28 tag of 2.25 in git, release into beta
2017-05-11 2.26 gets branched
2017-05-17 CE QA team finds regression in core revert (discussed in https://forum.snapcraft.io/t/snapd-2-25-blocked-because-of-revert-race-condition/)
2017-06-01 Revert snap-seccomp syntax additions in 2.26 branch, release new snapd into beta
2017-06-08 CE QA finds that bluez does not work after core revert, previous fix incomplete
2017-06-12 core decides to make seccomp code more robust in the revert case and works on seccomp-bpf branch along with general “system-key” improvements down the line (Versionized profiles)
2017-06-26 seccomp-bpf branch lands in edge
2017-06-27 CVE-2017-10600 keeps the team busy with fixing
2017-07-03 stable core based on 2.24+CVE-2017-10600 released
2017-07-05 core 2.26.8 with seccomp-bpf and fix for CVE-2017-10600 goes to beta
2017-07-06 error in trusty discovered with 2.26.8 (tsync bit setting)
2017-07-12 2.26.9 with fix for tsync bit error on trusty uploaded to beta