Nvidia CUDA on Ubuntu Core

kyrofa · April 15, 2017, 2:19am

I unfortunately don’t have any hardware capable of CUDA, so no :’( .

zyga-snapd · April 18, 2017, 12:59pm

I made some preparations so if I can get any nvidia hardware (something that uses recent drivers is best) I can give it a try. I now have an Intel motherboard with some ram and disk on my desk, that I can install various systems on.

kyrofa · April 19, 2017, 2:11pm

It was brought to my attention that AWS provides instances with access to CUDA-capable GPUs. Perhaps that would be a good testing ground for this? Google cloud, as well.

zyga-snapd · April 19, 2017, 2:47pm

I doubt that because 1) random kernel 2) extremely expensive compared to 50-100 card that I can use forever. I think we are at a stage where we need to tinker and experiment more. Those AWS instances only give you CUDA is you use their specific kernel.

kyrofa · April 19, 2017, 3:14pm

Ah, that would indeed be problematic. Well, I look forward to hearing about your tinkering, then! Thanks for looking into this .

kyrofa · June 8, 2018, 4:36pm

This is still required. Any progress on this front?

kyrofa · June 8, 2018, 6:09pm

We have a few blender snaps around. Blender can utilize CUDA, might provide a good test snap for enablement. blender-tpaw is strictly confined, and I can’t access CUDA. blender is classically confined, and I can access CUDA (yes, in the past year I got some CUDA hardware).

jdstrand · June 8, 2018, 6:44pm

Hmmm, there was some CUDA work done lately (Problem with confined nvenc / cuda ffmpeg snap, https://github.com/snapcore/snapd/pull/5189). Did you test with 2.33 or 2.34? What specifically do you mean by “can’t access CUDA”?

kyrofa · June 8, 2018, 6:55pm

@jdstrand you’re magic! blender-tpaw actually does seem to work with CUDA on 2.33 (refreshed to beta core snap). I didn’t see that other thread, good catch.

Also, “works” being defined as “shows up as an option here”:

Screenshot%20from%202018-06-08%2011-54-35

mbeneto · June 13, 2018, 8:33am

Hi @jdstrand. I refreshed the snap core as @kyrofa suggested me in this thread but I couldn’t get CUDA running when using devmode confinement.

Running the same program outside snap or in a classic confinement I get this:

OpenCV version: 2.4.10.1
CUDA runtime version: 6050
CUDA driver API version: 6050
CUDA devices: 1

However, when in devmode this is the output:

OpenCV version: 2.4.10.1
CUDA runtime version: 0
CUDA driver API version: 0
CUDA devices: -1
OpenCV Error: Gpu API call (CUDA driver version is insufficient for CUDA runtime version) in getDevice, file /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp, line 664
terminate called after throwing an instance of 'cv::Exception'
  what():  /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp:664: error: (-217) CUDA driver version is insufficient for CUDA runtime version in function getDevice

I’m not sure if it is necessary but I also included (and connected) some plugs to interfaces like opengl, hardware-observe, system-observe… as well as other not related with this. Am I missing to include something to enable CUDA support?

jdstrand · June 13, 2018, 9:49pm

This looks like there is a mismatch of some sort between your system and the snap. I’ve not worked with CUDA much myself, but understand that the libraries and the kernel driver must be in sync. IIRC, @mborzecki looked at this(?) once before, perhaps he has some additional information.

mborzecki · June 14, 2018, 6:58am

Which version of CUDA runtime do you use? Is it included in the snap?

mbeneto · June 14, 2018, 10:03am

Hi @mborzecki, thanks for pointing out that. It seems that the CUDA packages weren’t being included correctly. Now that I solved that, at least the API version is shown correctly, however, seems there’s still something missing at runtime.

OpenCV version: 2.4.10.1
CUDA runtime version: 0
CUDA driver API version: 6050
OpenCV Error: Gpu API call (unknown error) in getCudaEnabledDeviceCount, file /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp, line 652
terminate called after throwing an instance of 'cv::Exception'
  what():  /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp:652: error: (-217) unknown error in function getCudaEnabledDeviceCount

I’m setting the environment as follows:

environment:
    LD_LIBRARY_PATH: $SNAP/usr/local/cuda-6.5/lib:$SNAP/usr/lib/arm-linux-gnueabihf/:$LD_LIBRARY_PATH
    PATH: $SNAP/usr/local/cuda-6.5/bin:$PATH

mbeneto · June 15, 2018, 9:01am

I’ve been taking a closer look to the opengl interface and I think might have found the problem.

In order to allow access to nvidia devices, this plug opens in r/w mode the /dev/nvidia* descriptors. That’s right for desktop systems, however, I am running the snap from a TK1 board. In this board, those devices don’t exist, they are actually mapped as /dev/nvhost* and /dev/nvmap. Could you add this to the interface?

mborzecki · June 15, 2018, 9:21am

Can you paste the log of AppArmor denials? Does installing the snap in devmode allow it to run?

mbeneto · June 15, 2018, 10:20am

It doesn’t seem so, I already installed it in devmode.

Regarding the AppArmor, I’d tried this as I used to do but I got nothing:

sudo snap install snappy-debug
sudo snap connect snappy-debug:log-observe
/snap/bin/snappy-debug.security scanlog

It’s been a while I haven’t done this kind of debugging so I’m probably missing something… I’ll try to give you an error log next Monday!

mbeneto · June 18, 2018, 7:40am

Hi @mborzecki, I detected why I wasn’t getting anything from snappy-debug, neither with sudo grep audit /var/log/syslog. As I pointed in the other post, I’m running a L4T with a customized kernel, and it seems that that kernel doesn’t include the AppArmor module by default. This is what I was getting:

Jun 18 11:28:49 UBUNTU-TK1 snapd[530]: AppArmor status: apparmor not enabled

I recompiled the kernel adding the AppArmor module options. The situation improved but no as much as I expected:

Jun 18 14:22:19 UBUNTU-TK1 snapd[478]: AppArmor status: apparmor is enabled but some features are missing: caps, dbus, mount, namespaces, network, ptrace, signal

I tried to get something useful directly checking /var/log/syslog, and although now is a little bit more verbose, I couldn’t get anything but information at the system start-up (it doesn’t show more information regardless how many times I launch the different snaps).

Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.256:110): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.core" pid=1348 comm="apparmor_parser"
Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.261:112): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.hello-world" pid=1350 comm="apparmor_parser"
Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.262:113): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.snapcraft" pid=1351 comm="apparmor_parser"

I did a fast search about the “features missing” and it doesn’t allow me to be very optimistic as the most of the solutions involve to upgrade the kernel, thing that I cannot do as I would lose some devices support.

Regarding the original issue, I found out that when launching the standalone version and later launching the snap, the CUDA runtime API is correctly detected from the snap. On the other hand, if I launch first the snap and then the standalone version, none of them will detect the runtime API correctly.

I don’t know if this will give you any hint, just wanted to let you know as I haven’t realised this behaviour until now.

mborzecki · June 18, 2018, 7:53am

I wouldn’t bother enabling AppArmor if it’s not there.

Can you try and run the application under strace? snap run --strace <snap>.<app>.

Another thing that might be worth checking is whether the once the standalone instance run any new /dev nodes appear or drivers are loaded.

mbeneto · June 18, 2018, 9:00am

Regardless the snap was launched correctly or not, snap run --strace <snap>.<app> gives an endless output.

Regarding the /dev nodes and drivers loaded (lsmod?) I counted the number of elements with wc -l but I couldn’t see any difference between before the standalone launch and after it.

mborzecki · June 18, 2018, 9:27am

Are you saying that CUDA works just fine when running under strace? If the application works a worker or somesuch, you may need to pass -f to strace, eg. snap run --strace='-vf' <snap>.<app>