Core snap revert issues on core devices

mvo · October 19, 2017, 4:35pm

The CE QA team reported an issue with snap refresh core --edge and snap revert core back to stable. Right now this will cause the following error message when trying to run a snap (like test-snapd-tools):

cannot open current mount profile: /run/snapd/ns/snap.test-snapd-tools.fstab: Permission denied

It turns out there is a problem with the apparmor cache on reverts. It is using mtime of the /etc/apparmor.d/cache/usr.lib.snapd.snap-confine which will be more recent than the mtime of /etc/apparmor.d/usr.lib.snapd.snap-confine and therefore the reverted core will use the wong cached apparmor profile from edge. This was always wrong and now breaks because we dropped the rules about accessing the .fstab files which the latest snap-confine does no longer use but the version in stable still uses.

My preferred approach to fix this would be to make the apparmor cache smarter and use hashes or timestamps of the input files stored in the cache so that the cache can easily be invalidated. However I was made aware that this is some work so in the short term the suggestion from @jdstrand is that we should rm -rf /etc/apparmor.d/cache/* when the core snap changes on a core device.

niemeyer · October 19, 2017, 5:50pm

The behavior of apparmor seems reasonable in the sense that re-reading ever single profile to hash it reduces its efficacy as a cache.

It also sounds reasonable to clean the cache every time snapd detects that a full profile re-generation is due, and we already have logic for that in place for other reasons, so shouldn’t be too much work on our end.

@jdstrand What sort of care do we need to take for cleaning up the cache in a non-racy way?

mvo · October 19, 2017, 5:51pm

One approach discussed on irc was to hash the mtime of all input files instead of the content.

niemeyer · October 19, 2017, 6:03pm

That would be more reasonable indeed. Or even just copying the mtime of the source into the target cache file. That would be simple and pretty reliable.

mvo · October 19, 2017, 7:04pm

Writing the regression test for this is slightly more involved that I anticipated, the reason is that we currently modify the core snap itself do enable spread login. This needs to be ported to be done outside in the writeable space to make a test from a pristine stable core to a pristine edge core work.

jdstrand · October 19, 2017, 8:35pm

I think upon refresh/revert of the core snap, on non-classic, after the core snap is downloaded but before any other logic, just unlink any file in /etc/apparmor.d/cache with a corresponding file in /etc/apparmor.d. This is a safe operation since on reboot the cache files will be regenerated automatically. The idea is, just blow away the cache whenever the core snap is is refreshed/reverted and things will be fine.

As far as caching the hash of the mtimes of the files, this was discussed on IRC and there are other ways that may even be faster. I think we should work with upstream on this and not preempt them since getting the caching wrong can introduce weird bugs and we don’t want to negate the benefits of the cache. AIUI, this work is currently not scheduled for 18.04; I’ll let @tyhicks comment on its priority.

In the mean time, I feel it is completely reasonable to simply remove the cache files on revert/refresh of core. It would be a small, easy to understand change.

tyhicks · October 19, 2017, 9:00pm

There’s no priority right now because modifying apparmor_parser and the on-disk format of policy cache files isn’t currently planned. If the snapd team needs us to implement this, then we can figure out priority among the other apparmor work for snapd.

mvo · October 20, 2017, 7:24am

A possible fix is here https://github.com/snapcore/snapd/pull/4060 - I am also working on a improved spread test.