Nvidia CUDA on Ubuntu Core

zyga-snapd · June 19, 2018, 11:08am

Do you know where the firmware is located? It roughly depends on the place and the set of syscalls needed to achieve that.

mborzecki · June 19, 2018, 11:11am

These are the paths tried by that version of the kernel: https://elixir.bootlin.com/linux/v3.10.105/source/drivers/base/firmware_class.c#L262

static const char * const fw_path[] = {
	fw_path_para,
	"/lib/firmware/updates/" UTS_RELEASE,
	"/lib/firmware/updates",
	"/lib/firmware/" UTS_RELEASE,
	"/lib/firmware"
};

zyga-snapd · June 19, 2018, 11:13am

Given that this is done on a core system you will have a mount namespace with minimal changes (no pivot root, almost no bind mounts). I suspect this is a confinement problem more than the mount namespace problem.

mborzecki · June 19, 2018, 11:18am

@mbeneto is this on a Ubuntu Core system?

mborzecki · June 19, 2018, 11:25am

@mbeneto also please upload a strace -f of the standalone program.

mborzecki · June 19, 2018, 11:43am

@mbeneto a quick check to try after rebooting:

sudo /usr/lib/snapd/snap-discard-ns <your-snap>
sudo mount -o bind /lib/firmware /snap/core/current/lib/firmware
run the snap application

ogra · June 19, 2018, 2:34pm

i asked after the team meeting and there is no jetson TK1 anywhere in the devices team atm, sorry …

mborzecki · June 19, 2018, 6:16pm

Thanks for asking, appreciated!

mbeneto · June 20, 2018, 4:29am

Thanks for your support @mborzecki, @zyga-snapd, @ogra.

Yes, trying that after rebooting it works properly! (here is the strace, just in case)

Answering to the other questions:

This is not an Ubuntu Core system, it’s a L4T with a custom kernel, as the board being used is based on TK1 (basically the same but adding CAN interfaces).
Standalone strace

mborzecki · July 26, 2018, 8:36am

@zyga-snapd I think the issue still persists. Maybe we should consider mounting host’s /lib/firmware inside the mount namespace to cover for this. What’s your thought?

Not sure how to cover this with AppArmor properly though. Maybe @jdstrand has some suggestions.

zyga-snapd · July 26, 2018, 8:38am

We can try doing that, yeah. Could you make this a part of the opengl interface? That would be sufficient. Just use the mount specification please.

jdstrand · July 26, 2018, 12:59pm

I don’t have context for this. If applications need read-only access to files in /lib/firmware to work properly, we can add it. Ideally something like ‘/lib/firmware/nvidia/{,*} r,’ or ‘/lib/firmware/nv*.fw r,’. I don’t have nvidia hardware so don’t know the actual paths that should be used, but the idea is scope them to nvidia if possible. Happy to take a look in the PR.

All that said, I don’t think a snap should be allowed to actually load firmware. Reading files in /lib/firmware is one thing, but loading is another. I’m not sure otoh of the syscall/userspace mechanisms for this, but we should model a new firmware backend on the kmod backend for this if it’s needed so that snapd can load the firmware on the snap’s behalf. If the only reason the snap needs /lib/firmware is to load firmware it finds, then please consider a firmware backend and skip the snap-confine, snap-update-ns and apparmor changes.

mborzecki · July 27, 2018, 10:14am

It is my understand that what happens is as follows:

application does an ioctl to /dev/nvhost-as-gpu
the ioctl triggers an operation in the driver that attempts to switch some operating mode of the device
the driver does request_firmware()
using the old udev helper mechanism, the helper is run the same namespace as current, thus /lib/firmware is not populated with the right firmware files, no firmware is loaded, the device does not function as expected

It’s not uncommon for the driver to attempt loading the firmware on first access to the device. In which case, if the access happens within a snap we run into this problem. Also, in the case above there was no apparmor support in the kernel, so we did not get a chance to see the how the helper was labelled.

I think it boils down to a question whether we should mount /lib/firmware from the host on classic systems.

mbeneto · October 24, 2018, 9:14am

I’ve seen that a month ago there were some changes in the opengl interface. Because of that and because we migrated recently to Ubuntu 16.04 as a host system, I decided to give it another try to this, resulting in the same problems described months ago when accessing to the CUDA device.

Like before, if I execute the binding hack suggested by @mborzecki in this post before launching the snap I can successfully run it under devmode confinement. We would like to install this snap in several devices from now on, but I have some concerns:

How bad practice would be to keep using the binding hack suggested before launching the snap? Can this affect in any way to the performance of snapd?
@jdstrand , would adding access to /dev/nvmap and /dev/nvhost* help to get rid of this problem? Are more changes required? In any case, we are willing to help debugging if necessary.

kyrofa · October 30, 2018, 3:30pm

Hey @mborzecki, @jdstrand, where does this stand? Do we have a way forward?

mborzecki · October 31, 2018, 7:44am

I think we can extend the interface to allow accessing /dev/nvhost* and /dev/nvmap to get people started. There is also a PR extending opengl interface with CUDA related pieces, see here for details: https://github.com/snapcore/snapd/pull/5696/

As for the firmware, there were 2 mechanism in use:

the old one, where there is a userspace helper for loading the firmware - this could possibly be mediated by a helper provided by snapd (IIRC there was another one using uevents, but I’m not sure at this point and this would need further research)
the new one, where the kernel attempts to load the firmware file directly, if that fails there’s a fallback path to the old mechanism (if not disabled by kernel config)

The caveat of the firmwae mechanism is that it’s reactive, i.e. action happens only when requested by the driver and this could be equally well when the device is probed, userspace app opens a device node or issues a magical ioctl (as in the case of CUDA on TK1). Another caveat is that the helper is run in the same namespace as current at the time of the request being issued. Lastly, the mechanism can be completely disabled by the kernel configuration.

I feel like coming up with a firmware backend requires further research and maybe some experimentation. For now, I’d say that the bind hack is a reasonable workaround. Alternative workaround would be to have the driver load the firmware before a snap application is run triggered by some helper.

jdstrand · November 2, 2018, 1:36pm

I wonder if it makes sense to adjust the kmod backend (or use a new firmware backend) that operates like kmod in that the backend will tell the system to make sure the devices are created and the firmware is in place. We don’t allow snaps to load modules and instead have snapd configure the system to load them on behalf of the snap. Similarly, we wouldn’t let snaps create devices or load firmware, but let snapd configure the system to do this on behalf of the snap. Unlike kmod, OTOH I’m not aware of a mechanism for devices and firmware that works like /etc/modules-load.d, but that doesn’t mean snapd can’t create one, with an early systemd unit for doing all of this before snaps start launching.

mborzecki · November 5, 2018, 8:26am

I think that’s part of the problem. You can load modules upfront, but firmware is loaded on demand. This may happen when a driver is loaded, which would work for us. However, in the problem discussed earlier in the topic, a specific (and legit) ioctl triggers the driver to request the firmware.

jdstrand · November 5, 2018, 2:56pm

Right-- would it make sense for said backend to also call the ioctl?

mborzecki · November 6, 2018, 9:05am

I think it’s too driver specific. We could do a one-off for Jetson TK1 but that doesn’t mean it continues working for newer versions of the board.