Nvme block-devices support?

I’m trying to make a snap of nvme-cli, but it’s not quite working and I don’t understand why.

My snap uses block-devices interface, but I cannot access the block device using nvme-cli, as I keep getting EPERM. Specifically, I am running this command:

sudo nvme-cli id-ctrl /dev/nvme0n1

which tries to do this:

openat(AT_FDCWD, "/dev/nvme0n1", O_RDONLY) = -1 EPERM (Operation not permitted)

which is odd to me because if I get the major/minor for that device, it is a block device and the devices cgroup even says it should be in there:

$ udevadm info --query property --name /dev/nvme0n1
DEVPATH=/devices/pci0000:40/0000:40:01.1/0000:41:00.0/nvme/nvme0/nvme0n1
DEVNAME=/dev/nvme0n1
DEVTYPE=disk
MAJOR=259
MINOR=5
SUBSYSTEM=block
USEC_INITIALIZED=3872242
... other things
TAGS=:systemd:snap_nvme-cli_nvme-cli:
$ sudo cat /sys/fs/cgroup/devices/snap.nvme-cli.nvme-cli/devices.list | grep 259:5
c 259:5 rwm

I confirmed that I have block-devices connected:

$ snap connections nvme-cli 
Interface         Plug                       Slot               Notes
block-devices     nvme-cli:block-devices     :block-devices     manual
hardware-observe  nvme-cli:hardware-observe  :hardware-observe  manual
home              nvme-cli:home              :home              -
network-control   nvme-cli:network-control   -                  -
system-observe    nvme-cli:system-observe    :system-observe    manual

I will note that the AppArmor rule for block-devices for nvme specifically doesn’t support the multipath path like I have nvme0n1 and instead only seems to support nvme0 for example, but even when I add that path to AppArmor it still fails the same way and also I don’t see any denials from AppArmor which leads me to believe it is the devices cgroup denying access.

Finally, as an aside, is the block-devices interface a good place to add support for the nvme0 devices directly? I assume so because the AppArmor policy was written to support this, but the /dev/nvme0 device on my focal system doesn’t show up as a block subsystem device, it shows up as a nvme subsystem device:

$ udevadm info --query property --name /dev/nvme0
DEVPATH=/devices/pci0000:40/0000:40:01.1/0000:41:00.0/nvme/nvme0
DEVNAME=/dev/nvme0
NVME_TRTYPE=pcie
MAJOR=239
MINOR=0
SUBSYSTEM=nvme
... other things

Ping @jdstrand when you have some time.

P.S. as an aside, in case anyone goes looking, here is some relevant code in the linux kernel for where the format of /dev/nvme0 vs /dev/nvme0n1 comes from, they are the same device just with two different “handles” AFAICT.

Since I already wrote this code locally in snapd I just proposed it anyways and marked it blocked until we can sort out why this isn’t working with /dev/nvme0n1 paths. Note that with the above PR, I can use nvme-cli id-ctrl /dev/nvme0, since I added the SUBSYSTEM=="nvme" udev rule there too.

The short answer is that there is a bug in snap-device-helper that adds the device with c 259:0 rwm rather than b 259:0 rwm and the fix is to change this:

if [ "${DEVPATH#*/block/}" != "$DEVPATH" ]; then
    type="b"
else
    type="c"
fi

to be:

if [ "${DEVPATH#*/block/}" != "$DEVPATH" ]; then
    type="b"
elif [ "${DEVPATH#*/nvme/}" != "$DEVPATH" ]; then
    type="b"
else
    type="c"
fi

<tldr;>

I’m going to detail my process in case the techniques are useful to others since working with the device cgroup is not always intuitive. Thankfully, I had a device with an nvme drive which made things straightforward.

I created a small test snap called ‘test-nvme’, had it plugs block-devices then installed and connected the interface. I installed the nvme-cli deb on my system, ran ldd /usr/sbin/nvme, saw that it only linked to glibc, so I just copied it into SNAP_COMMON of test-nvme and confirmed the bug with:

$ sudo /usr/sbin/nvme list # unconfined, deb binary works
Node ...
$ sudo cp /usr/sbin/nvme /var/snap/test-nvme/common/
$ sudo snap run test-nvme.sh -c '$SNAP_COMMON/nvme list'
Failed to open /dev/nvme0n1: Operation not permitted

Nothing in the logs, so strace’d with:

$ sudo snap run --strace="-o ./trace" test-nvme.sh -c '$SNAP_COMMON/nvme list'
$ cat ./trace
...
411687 stat("/dev/nvme0n1p3", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x3), ...}) = 0
411687 stat("/dev/nvme0n1p2", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x2), ...}) = 0
411687 stat("/dev/nvme0n1p1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0x1), ...}) = 0
411687 stat("/dev/nvme0n1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x103, 0), ...}) = 0
411687 stat("/dev/nvme0", {st_mode=S_IFCHR|0600, st_rdev=makedev(0xf0, 0), ...}) = 0
411687 getdents(3, /* 0 entries */, 32768) = 0
411687 close(3)                         = 0
411687 open("/dev/nvme0n1", O_RDONLY)   = -1 EPERM (Operation not permitted)
...

UNIX permissions (aka Discretionary Access Controls, aka DAC) will return EPERM and the kernel uses this with device cgroup accesses as well. This combined with the lack of logs made me think it was the device cgroup.

I wanted to rule out AppArmor completely though in case there was a logging bug or something, so I executed the binary under just the apparmor profile with:

$ sudo aa-exec -p snap.test-nvme.sh /var/snap/test-nvme/common/nvme list
Failed to open /dev/nvme0n1: Permission denied

Right, this is what @ijohnson said that there is a bug in the apparmor policy not allowing the nvme namespaces, so I added the following to the apparmor profile, reloaded it into the kernel and tried again:

$ tail -2 /var/lib/snapd/apparmor/profiles/snap.test-nvme.sh
/dev/nvme{[0-9],[1-9][0-9]}n{[1-9],[1-5][0-9],6[0-3]} rw, # NVMe (up to 100 devices, with 1-63 namespaces)
...
$ sudo apparmor_parser -r /var/lib/snapd/apparmor/profiles/snap.test-nvme.sh
$ sudo aa-exec -p snap.test-nvme.sh /var/snap/test-nvme/common/nvme list
Node  ...

Ok, so all apparmor needed was that rule, but we knew that from @ijohnson’s post. So I tried the snap itself again:

$ sudo snap run test-nvme.sh -c '$SNAP_COMMON/nvme list'
Failed to open /dev/nvme0n1: Operation not permitted

I then confirmed what @ijohnson found with udevadm info /dev/nvme0 and udevadm info /dev/nvme0n1 and looking in /sys/fs/cgroup/devices/snap.test-nvme.sh/devices.list for the MAJOR and MINOR devices. Eg:

$ for d in /dev/nvme0* ; do ls -l $d ; udevadm info $d | grep -E '(MAJOR|MINOR|SUBSYSTEM|TAGS)' ; echo ; done ; grep -E '(240|259):' /sys/fs/cgroup/devices/snap.test-nvme.sh/devices.list 
crw------- 1 root root 240, 0 May 27 14:45 /dev/nvme0
E: MAJOR=240
E: MINOR=0
E: SUBSYSTEM=nvme

brw-rw---- 1 root disk 259, 0 May 27 14:45 /dev/nvme0n1
E: MAJOR=259
E: MINOR=0
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:

brw-rw---- 1 root disk 259, 1 May 27 14:45 /dev/nvme0n1p1
E: MAJOR=259
E: MINOR=1
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:

brw-rw---- 1 root disk 259, 2 May 27 14:45 /dev/nvme0n1p2
E: MAJOR=259
E: MINOR=2
E: SUBSYSTEM=block
E: TAGS=:snap_test-nvme_sh:systemd:

brw-rw---- 1 root disk 259, 3 May 27 14:45 /dev/nvme0n1p3
E: MAJOR=259
E: MINOR=3
E: SUBSYSTEM=block
E: TAGS=:systemd:snap_test-nvme_sh:

c 240:0 rwm
c 259:0 rwm
c 259:1 rwm
c 259:2 rwm
c 259:3 rwm

Since I gave the short answer above, ideally I would’ve stopped here, but the reality is that I didn’t format the output like the above (which makes it obvious) and human nature being what it is, sometimes we see what we expect to see. Ie, I didn’t notice the device cgroup had character devices instead of block devices. I did notice that /dev/nvme0 was under the nvme subsystem and not block and so it didn’t show up in the device cgroup (we’ll determine how to fix that in @ijohnson’s PR though).

(Aside: one could argue this has security impact since the wrong devices were exposed, however our security-in-depth strategy means that the character devices for these devices were disallowed by apparmor (not to mention, the block-devices interface is ‘super-privileged’ and needs an installation constraint so most snaps can’t use it anyway)).

At this point since I was erroneously thinking that the cgroup was correctly configured, I started to wonder if there was a kernel bug or something else. The first thing I did was adjust /etc/udev/rules.d/70-snap.test-nvme.rules to also include SUBSYSTEM=="nvme", TAG+="snap_test-nvme_sh" then ran sudo udevadm trigger --subsystem-match=nvme, ran udevadm again and tested the snap and no difference.

So to rule out everything except the device cgroup, I confirmed in a root shell in one terminal that nvme works without any cgroup changes:

$ sudo -i bash
# echo $$
397998
# grep devices /proc/self/cgroup
2:devices:/user.slice
# /usr/sbin/nvme list
Node             SN...

In another terminal I added that shell to the snap’s device cgroup:

$ sudo sh -c 'echo 397998 > /sys/fs/cgroup/devices/snap.test-nvme.sh/cgroup.procs'

Now back to the root shell:

# grep devices /proc/self/cgroup
2:devices:/snap.test-nvme.sh
# /usr/sbin/nvme list
Failed to open /dev/nvme0n1: Operation not permitted

This confirms that the device cgroup is (at least part of) the problem since I isolated that part of the sandbox and applied it to an arbitrary root process.

I then wanted to rule out all the snapd setup and create a simple device cgroup. So I created a default deny cgroup and added the root shell to it:

$ sudo mkdir /sys/fs/cgroup/devices/test-nvme-cgroup
$ sudo sh -c "echo 'a *:* rwm' > /sys/fs/cgroup/devices/test-nvme-cgroup/devices.deny"
$ cat /sys/fs/cgroup/devices/test-nvme-cgroup/devices.list # returning empty means nothing allowed
$
$ sudo sh -c 'echo 397998 > /sys/fs/cgroup/devices/test-nvme-cgroup/cgroup.procs'

Then I went to the root shell:

# grep devices /proc/self/cgroup
2:devices:/test-nvme-cgroup
# /usr/sbin/nvme list
Failed to open /dev/nvme0n1: Operation not permitted

I then had the idea to add every block device on the system to the device cgroup to see if another access was needed. So in the root shell running in the test-nvme-cgroup already (Aside: another reason why security in depth is important: while it was great that there was a device cgroup in effect, the root shell can modify it to its heart’s desire. In our case with snaps, AppArmor would’ve blocked it, but for our isolated debugging scenario, it was just what we needed :slight_smile: ):

$ for i in /sys/dev/block/* ; do echo $i ; echo "b $(basename $i) rwm" > /sys/fs/cgroup/devices/test-nvme-cgroup/devices.allow ; /usr/sbin/nvme list && break ; done
...
/sys/dev/block/253:3
Failed to open /dev/nvme0n1: Operation not permitted
/sys/dev/block/259:0
Node             SN ...

In this case adding the device worked! I then checked the device cgroup:

# cat /sys/fs/cgroup/devices/test-nvme-cgroup/devices.list
b 253:0 rwm
b 253:1 rwm
b 253:2 rwm
b 253:3 rwm
b 259:0 rwm

It was at this point it finally dawned on me that the snap’s device cgroup had c 259:0 rwm. After removing my hand from my forehead, I immediately looked in /usr/lib/snapd/snap-devicer-helper (I should’ve really looked at snap version and figured out if we re-execed and looked in /snap/snapd/current/usr/lib/snapd/snap-device-helper but I happened to know that we hadn’t changed anything in this area :wink: ).

While it took me a bit longer to diagnose the issue than it should’ve, I decided to turn that into a debug session walk through so hopefully others can learn about how things work together and some debug techniques.

1 Like

Great investigative work Jamie, thanks for the detailed post, I enjoyed reading some of your strategies for looking into this problem!

1 Like

Hmm unfortunately I don’t think your patch to snap-device-helper is quite right because we really need to distinguish between DEVPATH like this (which is for a block device):

/devices/pci0000:00/0000:00:1d.0/0000:04:00.0/nvme/nvme0/nvme0n1

and DEVPATH like this (which is for the char device):

/devices/pci0000:00/0000:00:1d.0/0000:04:00.0/nvme/nvme0

I was able to accomplish this by replacing the check you have above with:

# check if it's a block or char dev
# TODO: re-write this to be more robust, the bash variable substitution done 
# here is quite awkard :-/
if [ "${DEVPATH#*/block/}" != "$DEVPATH" ] ; then
    type="b"
elif [ "${DEVPATH#*/nvme/}" != "$DEVPATH" ]; then
    # nvme subsystem devices can either be block or character devices, the
    # block devices have a more deeply nested path than the character devices
    NVME_SUBPATH="${DEVPATH#*/nvme*/}"
    if [ "${NVME_SUBPATH#*/}" != "$NVME_SUBPATH" ]; then
        # it is more deeply nested
        type="b"
    else
        type="c"
    fi
else
    type="c"
fi

Good catch. I commented in the PR.