The short answer is that there is a bug in snap-device-helper that adds the device with c 259:0 rwm
rather than b 259:0 rwm
and the fix is to change this:
if [ "${DEVPATH#*/block/}" != "$DEVPATH" ]; then
type="b"
else
type="c"
fi
to be:
if [ "${DEVPATH#*/block/}" != "$DEVPATH" ]; then
type="b"
elif [ "${DEVPATH#*/nvme/}" != "$DEVPATH" ]; then
type="b"
else
type="c"
fi
<tldr;>
I’m going to detail my process in case the techniques are useful to others since working with the device cgroup is not always intuitive. Thankfully, I had a device with an nvme drive which made things straightforward.
I created a small test snap called ‘test-nvme’, had it plugs block-devices then installed and connected the interface. I installed the nvme-cli deb on my system, ran ldd /usr/sbin/nvme
, saw that it only linked to glibc, so I just copied it into SNAP_COMMON of test-nvme and confirmed the bug with:
$ sudo /usr/sbin/nvme list # unconfined, deb binary works
Node ...
$ sudo cp /usr/sbin/nvme /var/snap/test-nvme/common/
$ sudo snap run test-nvme.sh -c '$SNAP_COMMON/nvme list'
Failed to open /dev/nvme0n1: Operation not permitted
Nothing in the logs, so strace’d with:
$ sudo snap run --strace="-o ./trace" test-nvme.sh -c '$SNAP_COMMON/nvme list'
$ cat ./trace
...
411687 stat("/dev/nvme0n1p3", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x3), ...}) = 0
411687 stat("/dev/nvme0n1p2", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x2), ...}) = 0
411687 stat("/dev/nvme0n1p1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x1), ...}) = 0
411687 stat("/dev/nvme0n1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0), ...}) = 0
411687 stat("/dev/nvme0", {st_mode=S_IFCHR|0600, st_rdev=makedev(0xf0, 0), ...}) = 0
411687 getdents(3, /* 0 entries */, 32768) = 0
411687 close(3) = 0
411687 open("/dev/nvme0n1", O_RDONLY) = -1 EPERM (Operation not permitted)
...
UNIX permissions (aka Discretionary Access Controls, aka DAC) will return EPERM and the kernel uses this with device cgroup accesses as well. This combined with the lack of logs made me think it was the device cgroup.
I wanted to rule out AppArmor completely though in case there was a logging bug or something, so I executed the binary under just the apparmor profile with:
$ sudo aa-exec -p snap.test-nvme.sh /var/snap/test-nvme/common/nvme list
Failed to open /dev/nvme0n1: Permission denied
Right, this is what @ijohnson said that there is a bug in the apparmor policy not allowing the nvme namespaces, so I added the following to the apparmor profile, reloaded it into the kernel and tried again:
$ tail -2 /var/lib/snapd/apparmor/profiles/snap.test-nvme.sh
/dev/nvme{[0-9],[1-9][0-9]}n{[1-9],[1-5][0-9],6[0-3]} rw, # NVMe (up to 100 devices, with 1-63 namespaces)
...
$ sudo apparmor_parser -r /var/lib/snapd/apparmor/profiles/snap.test-nvme.sh
$ sudo aa-exec -p snap.test-nvme.sh /var/snap/test-nvme/common/nvme list
Node ...
Ok, so all apparmor needed was that rule, but we knew that from @ijohnson’s post. So I tried the snap itself again:
$ sudo snap run test-nvme.sh -c '$SNAP_COMMON/nvme list'
Failed to open /dev/nvme0n1: Operation not permitted
I then confirmed what @ijohnson found with udevadm info /dev/nvme0
and udevadm info /dev/nvme0n1
and looking in /sys/fs/cgroup/devices/snap.test-nvme.sh/devices.list
for the MAJOR and MINOR devices. Eg:
$ for d in /dev/nvme0* ; do ls -l $d ; udevadm info $d | grep -E '(MAJOR|MINOR|SUBSYSTEM|TAGS)' ; echo ; done ; grep -E '(240|259):' /sys/fs/cgroup/devices/snap.test-nvme.sh/devices.list
crw------- 1 root root 240, 0 May 27 14:45 /dev/nvme0
E: MAJOR=240
E: MINOR=0
E: SUBSYSTEM=nvme
brw-rw---- 1 root disk 259, 0 May 27 14:45 /dev/nvme0n1
E: MAJOR=259
E: MINOR=0
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:
brw-rw---- 1 root disk 259, 1 May 27 14:45 /dev/nvme0n1p1
E: MAJOR=259
E: MINOR=1
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:
brw-rw---- 1 root disk 259, 2 May 27 14:45 /dev/nvme0n1p2
E: MAJOR=259
E: MINOR=2
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:
brw-rw---- 1 root disk 259, 3 May 27 14:45 /dev/nvme0n1p3
E: MAJOR=259
E: MINOR=3
E: SUBSYSTEM=block
E: TAGS=:systemd:snap_test-nvme_sh:
c 240:0 rwm
c 259:0 rwm
c 259:1 rwm
c 259:2 rwm
c 259:3 rwm
Since I gave the short answer above, ideally I would’ve stopped here, but the reality is that I didn’t format the output like the above (which makes it obvious) and human nature being what it is, sometimes we see what we expect to see. Ie, I didn’t notice the device cgroup had character devices instead of block devices. I did notice that /dev/nvme0 was under the nvme
subsystem and not block
and so it didn’t show up in the device cgroup (we’ll determine how to fix that in @ijohnson’s PR though).
(Aside: one could argue this has security impact since the wrong devices were exposed, however our security-in-depth strategy means that the character devices for these devices were disallowed by apparmor (not to mention, the block-devices interface is ‘super-privileged’ and needs an installation constraint so most snaps can’t use it anyway)).
At this point since I was erroneously thinking that the cgroup was correctly configured, I started to wonder if there was a kernel bug or something else. The first thing I did was adjust /etc/udev/rules.d/70-snap.test-nvme.rules
to also include SUBSYSTEM=="nvme", TAG+="snap_test-nvme_sh"
then ran sudo udevadm trigger --subsystem-match=nvme
, ran udevadm again and tested the snap and no difference.
So to rule out everything except the device cgroup, I confirmed in a root shell in one terminal that nvme works without any cgroup changes:
$ sudo -i bash
# echo $$
397998
# grep devices /proc/self/cgroup
2:devices:/user.slice
# /usr/sbin/nvme list
Node SN...
In another terminal I added that shell to the snap’s device cgroup:
$ sudo sh -c 'echo 397998 > /sys/fs/cgroup/devices/snap.test-nvme.sh/cgroup.procs'
Now back to the root shell:
# grep devices /proc/self/cgroup
2:devices:/snap.test-nvme.sh
# /usr/sbin/nvme list
Failed to open /dev/nvme0n1: Operation not permitted
This confirms that the device cgroup is (at least part of) the problem since I isolated that part of the sandbox and applied it to an arbitrary root process.
I then wanted to rule out all the snapd setup and create a simple device cgroup. So I created a default deny cgroup and added the root shell to it:
$ sudo mkdir /sys/fs/cgroup/devices/test-nvme-cgroup
$ sudo sh -c "echo 'a *:* rwm' > /sys/fs/cgroup/devices/test-nvme-cgroup/devices.deny"
$ cat /sys/fs/cgroup/devices/test-nvme-cgroup/devices.list # returning empty means nothing allowed
$
$ sudo sh -c 'echo 397998 > /sys/fs/cgroup/devices/test-nvme-cgroup/cgroup.procs'
Then I went to the root shell:
# grep devices /proc/self/cgroup
2:devices:/test-nvme-cgroup
# /usr/sbin/nvme list
Failed to open /dev/nvme0n1: Operation not permitted
I then had the idea to add every block device on the system to the device cgroup to see if another access was needed. So in the root shell running in the test-nvme-cgroup already (Aside: another reason why security in depth is important: while it was great that there was a device cgroup in effect, the root shell can modify it to its heart’s desire. In our case with snaps, AppArmor would’ve blocked it, but for our isolated debugging scenario, it was just what we needed ):
$ for i in /sys/dev/block/* ; do echo $i ; echo "b $(basename $i) rwm" > /sys/fs/cgroup/devices/test-nvme-cgroup/devices.allow ; /usr/sbin/nvme list && break ; done
...
/sys/dev/block/253:3
Failed to open /dev/nvme0n1: Operation not permitted
/sys/dev/block/259:0
Node SN ...
In this case adding the device worked! I then checked the device cgroup:
# cat /sys/fs/cgroup/devices/test-nvme-cgroup/devices.list
b 253:0 rwm
b 253:1 rwm
b 253:2 rwm
b 253:3 rwm
b 259:0 rwm
It was at this point it finally dawned on me that the snap’s device cgroup had c 259:0 rwm
. After removing my hand from my forehead, I immediately looked in /usr/lib/snapd/snap-devicer-helper
(I should’ve really looked at snap version
and figured out if we re-execed and looked in /snap/snapd/current/usr/lib/snapd/snap-device-helper
but I happened to know that we hadn’t changed anything in this area ).
While it took me a bit longer to diagnose the issue than it should’ve, I decided to turn that into a debug session walk through so hopefully others can learn about how things work together and some debug techniques.