Nvidia CUDA on Ubuntu Core

It was brought to my attention that AWS provides instances with access to CUDA-capable GPUs. Perhaps that would be a good testing ground for this? Google cloud, as well.

I doubt that because 1) random kernel 2) extremely expensive compared to 50-100 card that I can use forever. I think we are at a stage where we need to tinker and experiment more. Those AWS instances only give you CUDA is you use their specific kernel.

Ah, that would indeed be problematic. Well, I look forward to hearing about your tinkering, then! Thanks for looking into this :slight_smile: .

This is still required. Any progress on this front?

1 Like

We have a few blender snaps around. Blender can utilize CUDA, might provide a good test snap for enablement. blender-tpaw is strictly confined, and I can’t access CUDA. blender is classically confined, and I can access CUDA (yes, in the past year I got some CUDA hardware).

Hmmm, there was some CUDA work done lately (Problem with confined nvenc / cuda ffmpeg snap, https://github.com/snapcore/snapd/pull/5189). Did you test with 2.33 or 2.34? What specifically do you mean by “can’t access CUDA”?

@jdstrand you’re magic! blender-tpaw actually does seem to work with CUDA on 2.33 (refreshed to beta core snap). I didn’t see that other thread, good catch.

Also, “works” being defined as “shows up as an option here”:

Screenshot%20from%202018-06-08%2011-54-35

3 Likes

Hi @jdstrand. I refreshed the snap core as @kyrofa suggested me in this thread but I couldn’t get CUDA running when using devmode confinement.

Running the same program outside snap or in a classic confinement I get this:

OpenCV version: 2.4.10.1
CUDA runtime version: 6050
CUDA driver API version: 6050
CUDA devices: 1

However, when in devmode this is the output:

OpenCV version: 2.4.10.1
CUDA runtime version: 0
CUDA driver API version: 0
CUDA devices: -1
OpenCV Error: Gpu API call (CUDA driver version is insufficient for CUDA runtime version) in getDevice, file /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp, line 664
terminate called after throwing an instance of 'cv::Exception'
  what():  /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp:664: error: (-217) CUDA driver version is insufficient for CUDA runtime version in function getDevice

I’m not sure if it is necessary but I also included (and connected) some plugs to interfaces like opengl, hardware-observe, system-observe… as well as other not related with this. Am I missing to include something to enable CUDA support?

This looks like there is a mismatch of some sort between your system and the snap. I’ve not worked with CUDA much myself, but understand that the libraries and the kernel driver must be in sync. IIRC, @mborzecki looked at this(?) once before, perhaps he has some additional information.

Which version of CUDA runtime do you use? Is it included in the snap?

Hi @mborzecki, thanks for pointing out that. It seems that the CUDA packages weren’t being included correctly. Now that I solved that, at least the API version is shown correctly, however, seems there’s still something missing at runtime.

OpenCV version: 2.4.10.1
CUDA runtime version: 0
CUDA driver API version: 6050
OpenCV Error: Gpu API call (unknown error) in getCudaEnabledDeviceCount, file /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp, line 652
terminate called after throwing an instance of 'cv::Exception'
  what():  /hdd/buildbot/slave_jetson_tk1_2/52-O4T-L4T/opencv/modules/dynamicuda/include/opencv2/dynamicuda/dynamicuda.hpp:652: error: (-217) unknown error in function getCudaEnabledDeviceCount

I’m setting the environment as follows:

environment:
    LD_LIBRARY_PATH: $SNAP/usr/local/cuda-6.5/lib:$SNAP/usr/lib/arm-linux-gnueabihf/:$LD_LIBRARY_PATH
    PATH: $SNAP/usr/local/cuda-6.5/bin:$PATH

I’ve been taking a closer look to the opengl interface and I think might have found the problem.

In order to allow access to nvidia devices, this plug opens in r/w mode the /dev/nvidia* descriptors. That’s right for desktop systems, however, I am running the snap from a TK1 board. In this board, those devices don’t exist, they are actually mapped as /dev/nvhost* and /dev/nvmap. Could you add this to the interface?

1 Like

Can you paste the log of AppArmor denials? Does installing the snap in devmode allow it to run?

It doesn’t seem so, I already installed it in devmode.

Regarding the AppArmor, I’d tried this as I used to do but I got nothing:

sudo snap install snappy-debug
sudo snap connect snappy-debug:log-observe
/snap/bin/snappy-debug.security scanlog

It’s been a while I haven’t done this kind of debugging so I’m probably missing something… I’ll try to give you an error log next Monday!

Hi @mborzecki, I detected why I wasn’t getting anything from snappy-debug, neither with sudo grep audit /var/log/syslog. As I pointed in the other post, I’m running a L4T with a customized kernel, and it seems that that kernel doesn’t include the AppArmor module by default. This is what I was getting:

Jun 18 11:28:49 UBUNTU-TK1 snapd[530]: AppArmor status: apparmor not enabled

I recompiled the kernel adding the AppArmor module options. The situation improved but no as much as I expected:

Jun 18 14:22:19 UBUNTU-TK1 snapd[478]: AppArmor status: apparmor is enabled but some features are missing: caps, dbus, mount, namespaces, network, ptrace, signal

I tried to get something useful directly checking /var/log/syslog, and although now is a little bit more verbose, I couldn’t get anything but information at the system start-up (it doesn’t show more information regardless how many times I launch the different snaps).

Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.256:110): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.core" pid=1348 comm="apparmor_parser"
Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.261:112): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.hello-world" pid=1350 comm="apparmor_parser"
Jun 18 14:22:31 UBUNTU-TK1 kernel: type=1400 audit(1529299351.262:113): apparmor="STATUS" operation="profile_replace" name="snap-update-ns.snapcraft" pid=1351 comm="apparmor_parser"

I did a fast search about the “features missing” and it doesn’t allow me to be very optimistic as the most of the solutions involve to upgrade the kernel, thing that I cannot do as I would lose some devices support.


Regarding the original issue, I found out that when launching the standalone version and later launching the snap, the CUDA runtime API is correctly detected from the snap. On the other hand, if I launch first the snap and then the standalone version, none of them will detect the runtime API correctly.

I don’t know if this will give you any hint, just wanted to let you know as I haven’t realised this behaviour until now.

I wouldn’t bother enabling AppArmor if it’s not there.

Can you try and run the application under strace? snap run --strace <snap>.<app>.

Another thing that might be worth checking is whether the once the standalone instance run any new /dev nodes appear or drivers are loaded.

Regardless the snap was launched correctly or not, snap run --strace <snap>.<app> gives an endless output.

Regarding the /dev nodes and drivers loaded (lsmod?) I counted the number of elements with wc -l but I couldn’t see any difference between before the standalone launch and after it.

Are you saying that CUDA works just fine when running under strace? If the application works a worker or somesuch, you may need to pass -f to strace, eg. snap run --strace='-vf' <snap>.<app>

I meant that when launching the snap under strace, it kept showing an output even when the program using the CUDA driver crashed. This was probably happening because the lauching of the program needing CUDA is wraped in a roslaunch and that automatically runs a ROS core that will survive even if the main application crash, and thus, the endless strace output.

Tomorrow I’ll give it a try with the other arguments, thanks!

I could manage to have an useful output using strace. I basically did two tests:

  • System start-up -> standalone application execution -> run the snap (this will work, as well as the afterwards calls)
  • System start-up -> run the snap directly (fail, subsequent calls to the standalone app or the snap will also fail)

I examined the logs looking for any nvidia/cuda related information. At the beginning, both contain the same:

PID  access("/sys/module/nvidia/version", F_OK) = -1 ENOENT (No such file or directory)
...
PID  futex(0xb659a1d4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
PID  open("/dev/shm/cuda_injection_path_shm", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)
PID  open("/home/ubuntu/snap/my-snap/x1/.nv/nvidia-application-profile-globals-rc", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/home/ubuntu/snap/my-snap/x1/.nv/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/home/ubuntu/snap/my-snap/x1/.nv/nvidia-application-profiles-rc.d", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/etc/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/etc/nvidia/nvidia-application-profiles-rc.d/", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/usr/share/nvidia/nvidia-application-profiles-21.4-rc", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  open("/usr/share/nvidia/nvidia-application-profiles-rc", O_RDONLY) = -1 ENOENT (No such file or directory)
PID  geteuid32()                       = 1000
PID  open("/tmp/nvidia-mps/control", O_WRONLY|O_NONBLOCK) = -1 ENOENT (No such file or directory)

However, later, there’s a slight difference when accessing to the board:


Running the snap after executing the standalone application

5890  stat64("/usr/bin/nvidia-modprobe", 0xbeb00e80) = -1 ENOENT (No such file or directory)
5890  open("/proc/driver/nvidia/params", O_RDONLY) = -1 ENOENT (No such file or directory)
5890  stat64("/dev/nvidiactl", 0xbeb00d48) = -1 ENOENT (No such file or directory)
5890  mknod("/dev/nvidiactl", S_IFCHR|0666, makedev(195, 255)) = -1 EACCES (Permission denied)
5890  geteuid32()                       = 1000
5890  stat64("/usr/bin/nvidia-modprobe", 0xbeb00e80) = -1 ENOENT (No such file or directory)
5890  open("/dev/nvidiactl", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory)
5890  open("/dev/nvhost-as-gpu", O_RDWR) = 11
5890  close(11)                         = 0
5890  open("/dev/nvmap", O_RDWR|O_DSYNC|O_CLOEXEC) = 11
5890  open("/dev/nvhost-as-gpu", O_RDWR|O_DSYNC) = 12
5890  ioctl(12, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18), 0xbeb00fc0) = 0
5890  open("/dev/nvhost-prof-gpu", O_RDWR) = 13
5890  open("/sys/devices/platform/host1x/gk20a.0/ptimer_scale_factor", O_RDONLY) = 14
5890  fstat64(14, {st_dev=makedev(0, 11), st_ino=7356, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=4096, st_atime=1529388093 /* 2018-06-19T15:01:33.222825993+0900 */, st_atime_nsec=222825993, st_mtime=1529388093 /* 2018-06-19T15:01:33.222825993+0900 */, st_mtime_nsec=222825993, st_ctime=1529388093 /* 2018-06-19T15:01:33.222825993+0900 */, st_ctime_nsec=222825993}) = 0
5890  read(14, "2.604166\n", 4096)      = 9
5890  close(14)                         = 0
5890  sysinfo({uptime=241, loads=[28256, 19104, 8192], totalram=2034458624, freeram=1112154112, sharedram=0, bufferram=80490496, totalswap=0, freeswap=0, procs=186, totalhigh=1271922688, freehigh=485691392, mem_unit=1}) = 0
5890  ioctl(12, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18), 0xbeb01030) = 0
5890  sysinfo({uptime=241, loads=[28256, 19104, 8192], totalram=2034458624, freeram=1112154112, sharedram=0, bufferram=80490496, totalswap=0, freeswap=0, procs=186, totalhigh=1271922688, freehigh=485691392, mem_unit=1}) = 0
5890  sysinfo({uptime=241, loads=[28256, 19104, 8192], totalram=2034458624, freeram=1112154112, sharedram=0, bufferram=80490496, totalswap=0, freeswap=0, procs=186, totalhigh=1271922688, freehigh=485691392, mem_unit=1}) = 0
5890  sysinfo({uptime=241, loads=[28256, 19104, 8192], totalram=2034458624, freeram=1112154112, sharedram=0, bufferram=80490496, totalswap=0, freeswap=0, procs=186, totalhigh=1271922688, freehigh=485691392, mem_unit=1}) = 0
5890  sysinfo({uptime=241, loads=[28256, 19104, 8192], totalram=2034458624, freeram=1112154112, sharedram=0, bufferram=80490496, totalswap=0, freeswap=0, procs=186, totalhigh=1271922688, freehigh=485691392, mem_unit=1}) = 0
5890  open("/sys/kernel/tegra_gpu/gpu_available_rates", O_RDONLY) = 14

From this point, there are some accesses more to /sys/kernel/tegra_gpu. Those accesses are not seen in the next test.


Running the snap directly

1957  stat64("/usr/bin/nvidia-modprobe", 0xbee73e80) = -1 ENOENT (No such file or directory)
1957  open("/proc/driver/nvidia/params", O_RDONLY) = -1 ENOENT (No such file or directory)
1957  stat64("/dev/nvidiactl", 0xbee73d48) = -1 ENOENT (No such file or directory)
1957  mknod("/dev/nvidiactl", S_IFCHR|0666, makedev(195, 255)) = -1 EACCES (Permission denied)
1957  geteuid32()                       = 1000
1957  stat64("/usr/bin/nvidia-modprobe", 0xbee73e80) = -1 ENOENT (No such file or directory)
1957  open("/dev/nvidiactl", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory)
1957  open("/dev/nvhost-as-gpu", O_RDWR) = 10
1957  close(10)                         = 0
1957  open("/dev/nvmap", O_RDWR|O_DSYNC|O_CLOEXEC) = 10
1957  open("/dev/nvhost-as-gpu", O_RDWR|O_DSYNC) = 12
1957  ioctl(12, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18), 0xbee73fc0) = -1 ENOENT (No such file or directory)
1957  ioctl(12, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x3, 0x10), 0xbee73f28) = -1 EINVAL (Invalid argument)
1957  close(10)                         = 0
1957  close(-1)                         = -1 EBADF (Bad file descriptor)
1957  close(12)                         = 0

As pointed before, there’s no access to /sys/kernel/tegra_gpu. Also, /dev/nvhost-prof-gpu and other devices don’t appear here.

Please, let me know if you need something else.