Nvidia CUDA on Ubuntu Core


#41

I don’t have context for this. If applications need read-only access to files in /lib/firmware to work properly, we can add it. Ideally something like ‘/lib/firmware/nvidia/{,*} r,’ or ‘/lib/firmware/nv*.fw r,’. I don’t have nvidia hardware so don’t know the actual paths that should be used, but the idea is scope them to nvidia if possible. Happy to take a look in the PR.

All that said, I don’t think a snap should be allowed to actually load firmware. Reading files in /lib/firmware is one thing, but loading is another. I’m not sure otoh of the syscall/userspace mechanisms for this, but we should model a new firmware backend on the kmod backend for this if it’s needed so that snapd can load the firmware on the snap’s behalf. If the only reason the snap needs /lib/firmware is to load firmware it finds, then please consider a firmware backend and skip the snap-confine, snap-update-ns and apparmor changes.


#42

It is my understand that what happens is as follows:

  • application does an ioctl to /dev/nvhost-as-gpu
  • the ioctl triggers an operation in the driver that attempts to switch some operating mode of the device
  • the driver does request_firmware()
  • using the old udev helper mechanism, the helper is run the same namespace as current, thus /lib/firmware is not populated with the right firmware files, no firmware is loaded, the device does not function as expected

It’s not uncommon for the driver to attempt loading the firmware on first access to the device. In which case, if the access happens within a snap we run into this problem. Also, in the case above there was no apparmor support in the kernel, so we did not get a chance to see the how the helper was labelled.

I think it boils down to a question whether we should mount /lib/firmware from the host on classic systems.


#43

I’ve seen that a month ago there were some changes in the opengl interface. Because of that and because we migrated recently to Ubuntu 16.04 as a host system, I decided to give it another try to this, resulting in the same problems described months ago when accessing to the CUDA device.

Like before, if I execute the binding hack suggested by @mborzecki in this post before launching the snap I can successfully run it under devmode confinement. We would like to install this snap in several devices from now on, but I have some concerns:

  • How bad practice would be to keep using the binding hack suggested before launching the snap? Can this affect in any way to the performance of snapd?
  • @jdstrand , would adding access to /dev/nvmap and /dev/nvhost* help to get rid of this problem? Are more changes required? In any case, we are willing to help debugging if necessary.

#44

Hey @mborzecki, @jdstrand, where does this stand? Do we have a way forward?


#45

I think we can extend the interface to allow accessing /dev/nvhost* and /dev/nvmap to get people started. There is also a PR extending opengl interface with CUDA related pieces, see here for details: https://github.com/snapcore/snapd/pull/5696/

As for the firmware, there were 2 mechanism in use:

  • the old one, where there is a userspace helper for loading the firmware - this could possibly be mediated by a helper provided by snapd (IIRC there was another one using uevents, but I’m not sure at this point and this would need further research)
  • the new one, where the kernel attempts to load the firmware file directly, if that fails there’s a fallback path to the old mechanism (if not disabled by kernel config)

The caveat of the firmwae mechanism is that it’s reactive, i.e. action happens only when requested by the driver and this could be equally well when the device is probed, userspace app opens a device node or issues a magical ioctl (as in the case of CUDA on TK1). Another caveat is that the helper is run in the same namespace as current at the time of the request being issued. Lastly, the mechanism can be completely disabled by the kernel configuration.

I feel like coming up with a firmware backend requires further research and maybe some experimentation. For now, I’d say that the bind hack is a reasonable workaround. Alternative workaround would be to have the driver load the firmware before a snap application is run triggered by some helper.


#46

I wonder if it makes sense to adjust the kmod backend (or use a new firmware backend) that operates like kmod in that the backend will tell the system to make sure the devices are created and the firmware is in place. We don’t allow snaps to load modules and instead have snapd configure the system to load them on behalf of the snap. Similarly, we wouldn’t let snaps create devices or load firmware, but let snapd configure the system to do this on behalf of the snap. Unlike kmod, OTOH I’m not aware of a mechanism for devices and firmware that works like /etc/modules-load.d, but that doesn’t mean snapd can’t create one, with an early systemd unit for doing all of this before snaps start launching.


#47

I think that’s part of the problem. You can load modules upfront, but firmware is loaded on demand. This may happen when a driver is loaded, which would work for us. However, in the problem discussed earlier in the topic, a specific (and legit) ioctl triggers the driver to request the firmware.


#48

Right-- would it make sense for said backend to also call the ioctl?


#49

I think it’s too driver specific. We could do a one-off for Jetson TK1 but that doesn’t mean it continues working for newer versions of the board.


#50

Hey

I’m working on CUDA support for both core and classic. On classic it should support the regular GPUs. On core it will surely support the NVIDIA Jetson nano, perhaps more as the initial support lands.

More details about this Hello World CUDA Analysis and GPU Support Proposal


#51

Hey @zyga, what’s your latest progress on this? I’ve got specific interest in using this on a Jetson Nano using the NVIDIA 4.9 kernel (I’ve heard it fully runs snapd just fine).


#52

This work is still scheduled and in the roadmap but somewhat behind several other Nvidia tasks. The plan for now is to work on a few bugfixes for 2.40 and 2.41 and then switch gears to nvidia / opengl work. I will start from the desktop and by August switch to the Jetson Nano. All this assuming that core20 work that is currently being designed will not change those plans.


#53

Hi Guys, I would like to offer some information in regards to supporting the Nvidia Jetson products within snap confinement. This is not specific to Ubuntu core.

The hardware I’m using is the Nvidia Jetson running Linux for Tegra (Ubuntu 18.04) kernel version 4.9 however all recent Linux for tegra images have a very similar file and device structure for example the Jetson tx1 and jetson tx2

unfortunately the opengl libraries and system devices and firmware are not in the typical location they would normally be on an Ubuntu 18.04 that is running an nvidia GPU.

If anyone has a quick fix for me to be able to enable opengl support in snaps that would be awesome

This is a breakdown of the important file paths required for opengl support / cuda…

Opengl libraries are in the following locations :

/usr/lib/aarch64-linux-gnu/tegra/
ld.so.conf
libcuda.so
libcuda.so.1
libcuda.so.1.1
libdrm.so.2
libGLX_nvidia.so.0
libnvapputil.so
libnvargus.so
libnvargus_socketclient.so
libnvargus_socketserver.so
libnvavp.so
libnvbuf_utils.so
libnvbuf_utils.so.1.0.0
libnvcameratools.so
libnvcamerautils.so
libnvcam_imageencoder.so
libnvcamlog.so
libnvcamv4l2.so
libnvcolorutil.so
libnvdc.so
libnvddk_2d_v2.so
libnvddk_vic.so
libnveglstream_camconsumer.so
libnveglstreamproducer.so
libnveventlib.so
libnvexif.so
libnvfnet.so
libnvfnetstoredefog.so
libnvfnetstorehdfx.so
libnvgov_boot.so
libnvgov_camera.so
libnvgov_force.so
libnvgov_generic.so
libnvgov_gpucompute.so
libnvgov_graphics.so
libnvgov_il.so
libnvgov_spincircle.so
libnvgov_tbc.so
libnvgov_ui.so
libnvidia-eglcore.so.32.1.0
libnvidia-egl-wayland.so
libnvidia-egl-wayland.so.1
libnvidia-fatbinaryloader.so.32.1.0
libnvidia-glcore.so.32.1.0
libnvidia-glsi.so.32.1.0
libnvidia-glvkspirv.so.32.1.0
libnvidia-ptxjitcompiler.so.1
libnvidia-ptxjitcompiler.so.32.1.0
libnvidia-rmapi-tegra.so.32.1.0
libnvidia-tls.so.32.1.0
libnvid_mapper.so
libnvid_mapper.so.1.0.0
libnvimp.so
libnvjpeg.so
libnvll.so
libnvmedia.so
libnvmm_contentpipe.so
libnvmmlite_image.so
libnvmmlite.so
libnvmmlite_utils.so
libnvmmlite_video.so
libnvmm_parser.so
libnvmm.so
libnvmm_utils.so
libnvodm_imager.so
libnvomxilclient.so
libnvomx.so
libnvosd.so
libnvos.so
libnvparser.so
libnvphsd.so
libnvphs.so
libnvrm_gpu.so
libnvrm_graphics.so
libnvrm.so
libnvscf.so
libnvtestresults.so
libnvtnr.so
libnvtracebuf.so
libnvtvmr.so
libnvtx_helper.so
libnvwinsys.so
libsensors.hal-client.nvs.so
libsensors_hal.nvs.so
libsensors.l4t.no_fusion.nvs.so
libtegrav4l2.so

/usr/lib/aarch64-linux-gnu/tegra-egl
ld.so.conf
libEGL_nvidia.so.0
libGLESv1_CM_nvidia.so.1
libGLESv2_nvidia.so.2
nvidia.json

The relevant GPU devices are as follows :

graphics devices

KERNEL==“l3cache” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“nvmap” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“nvram” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“nvhdcp*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“nvhost*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“nvhost-dbg-gpu” OWNER=“root” GROUP=“root” MODE=“0666”
KERNEL==“nvhost-prof-gpu” OWNER=“root” GROUP=“root” MODE=“0666”
KERNEL==“tegra*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“ion” OWNER=“root” GROUP=“video” MODE=“0666”

camera and related devices

KERNEL==“torch” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“ov*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“focuser*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“camera*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“imx*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“sh5*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“tps*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“mipi-cal” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“ar*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“camchar*” OWNER=“root” GROUP=“video” MODE=“0666”
KERNEL==“capture-*” OWNER=“root” GROUP=“video” MODE=“0666”

root only devices

KERNEL==“knvrm” OWNER=“root” GROUP=“root” MODE=“0666”
KERNEL==“knvmap” OWNER=“root” GROUP=“root” MODE=“0666”

Firmware Dir’s
/lib/firmware/tegra21x/
/lib/firmware/nvidia
/lib/firmware/gm20b