Snapd 2.37 breaks existing snap installation

Hey,

yesterday our CI boxes refreshed to latest snapd 2.37. We have two snaps important to build our software installed: go, snapcraft

Everything is being build from jenkins as user jenkins whichs HOME directory is /var/lib/jenkins. Since today various commands fail with the following error:

cannot create user data directory: /var/lib/jenkins/snap/go/3129: Permission denied

Looking into the system log reveals:

[765251.054724] audit: type=1400 audit(1548746770.146:4423): apparmor="DENIED" operation="open" profile="/snap/core/6259/usr/lib/snapd/snap-confine" name="/var/" pid=35086 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=112 ouid=0

Same problem appears for the snapcraft snap when it tries to build snaps with snapcraft cleanbuild:

cannot create user data directory: /var/lib/jenkins/snap/snapcraft/2374: Permission denied

The permissions of the directories are all correctly setup

$ ls -alh /var/lib/jenkins/snap/go/
total 32K
drwxr-xr-x 8 jenkins jenkins 4.0K Jan 29 06:45 .
drwxr-xr-x 6 jenkins jenkins 4.0K Jan 25 06:38 ..
drwxr-xr-x 2 jenkins jenkins 4.0K Sep  7 17:23 2635
drwxr-xr-x 2 jenkins jenkins 4.0K Dec  5 12:38 3039
drwxr-xr-x 2 jenkins jenkins 4.0K Dec 14 07:01 3080
drwxr-xr-x 2 jenkins jenkins 4.0K Dec 17 17:31 3095
drwxr-xr-x 2 jenkins jenkins 4.0K Jan 29 06:45 3129
drwxr-xr-x 2 jenkins jenkins 4.0K Sep  7 17:23 common
lrwxrwxrwx 1 jenkins jenkins    4 Jan 25 00:50 current -> 3129

Any idea what is wrong here?

regards,
Simon

Thanks for your report. The technical reason for this issue is that we removed the ā€œquirksā€ handling in the snap-confine code. This code used to require that we allow snap-confine to write to /var/lib/. So this worked for you by luck. Now of course if things break for you that is bad and we need to look into this.

We could restore the old apparmor profile for snap-confine - this would give it more permissions that it should have. Not sure if @jdstrand like this.

1 Like

@mvo Thanks for the quick help!

I will revert to the older core snap now.

We are fixing this as a part of https://github.com/snapcore/snapd/pull/6446 ā€“ the commit message there explains what happened in detail.

2 Likes

Thanks for the PR, we are currently blocked in our CI to do charm builds/testing.

2 Likes

Now that this PR is merged, could I possibly get an ETA on when it will be released to edge? We are currently blocked on all kubernetes ci testing

Sorry for the trouble @adam.stokes ! The fix is now in the ā€œcandidateā€ channel for core. Could you please snap refresh --candidate core and let us know if that fixes things for you?

1 Like

Same for @morphis - The fix is now in the ā€œcandidateā€ channel for core. Could you please snap refresh --candidate core and let us know if that fixes things for you?

We would also like to encourage people to run candidate if possible in some of their deployments. We do a lot of testing before updates hit candidate so generally they are good - and as this incident shows, sometimes even stable has regressions. Having more real-world workloads on candidate would help minimize those (but I understand of course that its tricky and everyone is wary of the risk).

2 Likes

@mvo This works for us! Weā€™ll keep running core in candidate mode to help test, you guys are really great about turning around critical fixes.

Thanks again

2 Likes

this is handled in

https://github.com/snapcore/snapd/pull/6446

1 Like

Does this require special cases to fix?

The wal-e snap is also broken, breaking all PostgreSQL charm deployments configured to use WAL archiving for disaster recovery. As of the automatic update to 2.37, WAL archiving stopped working, and disk partitions started filling. SREs have been reverting the core update to fire fight, but this is obviously not a long term fix. Affected systems include the snap store and related databases.

wal-e is a classic snap, generally run as the ā€˜postgresā€™ user with $HOME /var/lib/postgresql.

1 Like

Thanks for reporting this - we will fix this ASAP. The background here is that we had /var/lib/ writeable for snap-confine. Because postgresql is using /var/lib/postgresql we broke that too :confused: Sorry! Technically this is a bugfix but of course breaking existing software is unacceptable so we will find a solution for this one too (either via special case or by reverting the entire fix).

1 Like

Why canā€™t classic snaps generally write to /var/lib?

That is the right question to ask :slight_smile: The culprit is snap-confine which is very restricted to protect systems. But here it is protecting too much.

Yes, the issue here is not the confinement of the snap, but the confinement of snapd own trampolines, and the fact that these are non /home/* home directories.

Hi,

since some days my wekan instance is not working (installed via snap), and all snap commands are showing following output:

https://github.com/wekan/wekan-snap/issues/76#issuecomment-459972925

Does anyone have an idea, if this is a snap bug maybe?

@napcraftiojugendbew I moved your topic into this one, but hopefully your question is answered by it directly. If not please speak up.

1 Like

I tried to update the core via

deployer@jenkins:~$ sudo snap refresh --candidate core
core (candidate) 16-2.37.1 from Canonicalāœ“ refreshed

but it still doesnā€™t work?

deployer@jenkins:~$ sudo snap refresh
error: cannot perform the following tasks:
- Run post-refresh hook of "wekan" snap if present (run hook "post-refresh": cannot perform operation: mount --rbind /var/lib/jenkins /tmp/snap.rootfs_HZ7xsB//var/lib/jenkins: Permission denied)

Hints appreciated! :confused:

Following worked for me:

I had to downgrade to 2.36.3, afterwards I could again refresh my wekan packages.

deployer@jenkins:~$ sudo snap revert --revision=6130 core
core reverted to 16-2.36.3

deployer@jenkins:~$ sudo snap refresh wekan
wekan 2.17 from Lauri Ojansivu (xet7) refreshed

The current release 16-2.37 doesnā€™t work for me and shows the error message included in the github issue above.