I’m investigating the odd errors that some people reported about snap-confine being unable to mkdir /run/snapd/ns directory. While we had many theories what may the problem be (and one of those turned out to be true, when snapd runs under LXD with apparmor stacking/confinement disabled) there seems to be more going on still:
I patched my apparmor installation to trace all calls to apparmor_parser and discovered this log after a reboot:
Among this log you will find two entries interesting: --cache-loc=/etc/apparmor.d/cache and --cache-loc=/var/cache/apparmor
So it seems that we use two locations for the cache, depending on who’s calling. Quick inspection of those directories shos that there is no intersection (for whatever reason)
It does though begs for the question, why? @jdstrand can you shed some light on this? Should we unify under one specific directory? EDIT: and one more question, how is that cache pruned of stale entires?
/etc/apparmor.d/cache should be considered owned by apparmor. /etc/apparmor.d is the location for so called ‘system profiles’ where users and applications can put profiles that apparmor will manage (ie, apparmor init will write out cache files for anything in /etc/apparmor.d in /etc/apparmor.d/cache).
Some applications (eg libvirt, snappy, etc) choose to manage their own profiles, so they store their profiles somewhere else (eg, /var/lib/snapd/apparmor/profiles on snappy).
Ubuntu’s apparmor init is patched to understand /var/lib/snapd/apparmor/profiles and will load cache files from /var/cache/apparmor for those profiles (this concept dates back to Touch and snappy v1 (but they of course used different profile directories)). Profile caches for snappy end up there and apparmor init makes sure these are loaded on early boot.
With all of the recent changes with snap-reexec and deterministic profiles loads, it may make sense to reconsider /var/cache/apparmor for snappy’s cache dir and let it manage its own cache dir itself (eg, /var/lib/snapd/apparmor/cache). The main points to consider are:
cache files appreciably speed up profile load times because the profile doesn’t have to be compiled (especially on armhf)
the profile must be loaded into the kernel before the application starts (eg snap-confine or snap commands). If not loaded into the kernel first, in the case of snap-confine, because the profile uses binary attachment it will run unconfined. In the case of snap commands, they will fail to start because snap-confine won’t be able to run it due to profile not found.
Notice in your trace that the first two lines (ie, where --write-cache is listed as the first argument) are coming from apparmor boot init where the others are snapd.
Nice, at least we’re using a more reasonable place for them.
Now, we need to find out why on certain occasions snap-confine is getting the wrong profile loaded. That one profile does end up in /etc/apparmor.d because it’s provided by the package.
Perhaps, but it is a very old decision that makes things convenient for distros and administrators who just want to plop a file in /etc/apparmor.d and then have the system DTRT (ie, load it on boot, enable caching, etc). Applications who have their own ideas about profile loads are under no obligation to use /etc/apparmor.d.
This was initially the correct choice for ubuntu-core-launcher and early versions of snap-confine since it was installed as part of the system and there was only one profile on disk and the apparmor init boot process would figure out when to invalidate or use the cache easily enough. With all the re-exec work though, it may make sense to move snap-confine’s profile somewhere else and have snapd (or something else) load it. This could ensure that the correct profile is loaded and would avoid any distro-specific issues regarding when the apparmor initscript/systemd unit relative to snapd (distros should have apparmor load extremely early of course), but comes with its own set of challenges of making sure it is loaded before anything runs snap-confine.
Sorry for not being clear. It’s the caching in that place that sounds awkward. /etc is not a place to write down content that is not intended to be manipulated by people. Sounds fine to have /etc/apparmor.d itself as a place for security profiles.
Either way, you are right, that’s not a problem we need to solve in this context.
snap-confine is the first thing that executes in many cases. We can’t have snapd itself loading the profile for it on boot, and it seems to make sense that this is done by something from the system itself as it’s a general problem if one intends to use apparmor at all.
Oh yes, apparmor upstream has regretted that location for some time, but it is at least a decade old decision at this point and there is a lot of inertia at this point. :\
Yeah, that is why it wasn’t moved yet-- it isn’t a bad place for it to be now and if moved we would need another systemd unit/etc to make sure it was loaded, which begs the question why we are duplicating that work. I know there was some talk of moving the profile out of /etc/apparmor.d for other reasons, so thought I’d mention it.
I wanted to open a new thread about this but then I noticed we already have one here:
@ogra and I just debugged this issue and we believe to have an explanation:
There are two separate issues here, well maybe three if you include the behavior of apparmor caching layer:
The kernel doesn’t write timestamps (last write time, last mount time) to ext4 superblock if the filesystem is remounted as ro before being finally unmounted
This is what causes our ext4 superblock timestamps to effectively stay frozen in time. When the system shuts down the / is remounted ro and then unmounted. This means we don’t write those timestamps (just those in the superblock).
On a device like the Raspberry Pi that don’t have a battery backed RTC we use the last mount time as initial time. Apart from the fact that this is a constant that doens’t change in our images this has another issue. It is actually older than last write time (once that timestamp functions correctly) and it may put the system “into the past” with regards to the timestamps on the apparmor profile source file and apparmor profile cache file. Depending on how the cache works this may cause us to load the cache even when it is stale. The fixrtc script should use the last write time, falling to last mount time if the former is unset.
The cache system should be content based, not date based but we already know that.
This was tested on a Fedora 29 VM with a loopback device and ext4. Remounting the filesystem as read-only before unmounting the filesystem results in wrong data in the superblock (timestamps are not updated to reflect reality, e.g. written files on disk are more recent than “last write time”). It was also tested on a raspberry pi running ubuntu core and a 4.4 kernel when loading an apparmor profile without considering the cache resulted in correct behavior but rebooting the system loaded stale cache entry. It was also tested there by observing the output of “dumpe2fs” on the root filesystem. It shows that the system perpetually believes it is the moment the image was first created (when the filesystem was formatted).
So some more thoughts on this bug. On a Raspberry Pi we have no clock so on first boot we just set the time to the moment when the ext4 on /writable was first created. Then we mount / and carry on. This sets the last-mount-time of / to a few seconds after the image was created. Then we reboot and we set the time to last-mount-time of / which is really very close to the time when the system was created… see the problem? This will continue forever, “slowly” ticking seconds between each boot. This explains the pastebin above where the time is roughly 12 minutes after the moment when the filesystem was created (after about two weeks of real time and 18 reboot cycles).
I’m looking into apparmor caching logic now. I will update this answer when I have more data.
The fixrtc logic is clearly wrong. It was stated elsewhere that on these devices the clock can only be reasonably expected to be accurate after network time is fetched. Why not simply adjust so that after the time is fetched, that date is stored off for fixrtc to use on the next boot?
The original purpose (back in 2011/12) was to prevent mount failures due to “last mount time is in the future” by forcing the clock to the filesystem creation (or if available, “last mount”) date.
see above … to read any “stored off data” you would have to have a filesystem mounted, the fixrtc script was originally designed to fix a mount problem, so it runs prior to mounting any filesystem.
@abeato has worked on a post-mount enhancement for it at Ubuntu Pastebin that could be merged (though it is very snap centric and would have to go into initramfs-tools-ubuntu-core instead of initramfs-tools itself or have to be made more generic).
Alternatively there is a “fake-hwclock” package in raspbian that could be pulled into core and used (i think that stores a timestamp on disk and simply gets started from an early systemd unit)
I was perhaps a bit short on my response. It with the systemd remount-ro bit is wrong for it to be involved with relying on it to set something remotely close to the actual time.
Right this is why i propose to use alfonsos fixrtc enhancement, it simply does a “stat -c %Y /var/lib/snapd/snaps”, that will get you a pretty accurate date being the relative modification time of that folder against the epoch (this should definitely fix the apparmor cache issue, given the profile file comes with a snap that was installed in this folder).
I now believe to fully understand the situation where a de-facto stale cache is used.
It is 2018, a core snap update triggers system reboot
it is 2016 now, because we don’t have a clock across reboot and fixrtc is broken
System startup code runs apparmor.service, this loads cached profiles because cache.mtime >= source.mtime - those profiles refer to snapctl at /usr/bin/snapctl
Service startup triggers system key regeneration - all profiles are re-written, referring to snapctl at /usr/lib/snapd/snapctl
apparmor_parser is used to load changed profiles but, because the cache files are from 2018 they are used without referencing updated source files (which date to 2016 since we don’t have good time yet)
NTP sync the time, it is 2018 again
Manually loading the profile (normally cache is not enabled so that part is avoided) “fixes” the problem ogra saw
I will prepare a minimal patch that only uses cache files if mtime matches exactly. This is correct since mtime of cache files is synchronized to source file mtime when a cache entry is written.
Thinking about how to solve this issue knowing that there are multiple source files due to an include mechanism. I think we can keep the existing check (mtime of all the source files must be < mtime of the cache) and then add an additional check where we also verify that mtime of the 1st source file is identical to the mtime of the cache. Since the mtime of the cache is set to the mtime of the source this feels correct.
That won’t work in all cases (though would for many) since the parser will actually look at and use the newest mtime between the profile and any of its include files. Eg, if the base abstraction is newer than the profile, its mtime will be used. This might not be a problem in practice, but you would need to think through it.