CUDA for GPGPU on NVIDIA Jetson

theseankelly · June 16, 2020, 8:33pm

Hey Forum,

I’m having some issues with deploying GPU accelerated programs for GPGPU and/or computer vision applications via snap containers. There are a few related topics on this forum, but I think my questions are different enough to merit a new thread instead of hijacking an existing one.

I have the following repository: https://github.com/theseankelly/jetson-cuda-snap which started as a fork of @abeato 's work to enable X11-based programs here: https://github.com/alfonsosanchezbeato/jetson-nano-x11. Starting from his repository I have made the following changes:

dropped X11 support as I don’t need to render anything
upgraded the L4T packages from 32.2 to 32.3.1 (JetPack 4.3) using the TX2 debians, not nano
Added CUDA 10.0 from JetPack 4.3 targeting the TX2
Added a few sample CUDA programs:
- query.cu – queries for information about the underlying GPU device
- add.cu – simple application to add a couple vectors together. Pulled from the tutorial listed at the top of the source file
- saxpy.cu – similar application, except it uses explicit memory allocations and copies instead of managed memory

I’ve managed to compile the CUDA applications via part of my snapcraft recipe. I’m installing the snap on my TX2 (have tried both devmode and strict) running Linux for Tegra, where snap --version lists:

uskellse@uskellse-tx2:~$ snap --version
snap    2.45.1
snapd   2.45.1
series  16
ubuntu  18.04
kernel  4.9.140-tegra

When running my applications, I get mixed results. It looks like the query-gpu program can actually communicate with the underlying driver/GPU:

uskellse@uskellse-tx2:~$ jetson-cuda-snap.query-gpu
cudaGetDeviceCount returned: 0
Device Number: 0
  Device name: NVIDIA Tegra X2
  Memory Clock Rate (KHz): 1300000
  Memory Bus Width (bits): 128
  Peak Memory Bandwidth (GB/s): 41.600000

Meanwhile, the add-gpu and saxpy programs do run (and the CUDA APIs do not return errors), but the underlying data is not properly added and transferred to the results buffer, suggesting the code isn’t actually executing on the GPU, or it’s not crossing memory boundaries, or something. Not sure.

Both add-gpu and saxpy behave as expected if I build/execute them natively from Linux for Tegra.

What’s the current state of CUDA on Tegra platforms within snaps? I’m far from a CUDA expert so very possible I’m doing something fundamentally wrong here.

NOTE: I also tried a similar experiment with my Jetson Nano which is running Ubuntu Core 18 (per @abeato’s tutorials again, and of course using the binaries from the nano version of L4T 32.3.1 instead of the TX2). I didn’t make it quite as far – the GPU fails to query and the add-gpu and saxpy programs segfault. Didn’t take it further since the TX2 running L4T is more interesting to me anyhow…

abeato · June 17, 2020, 10:13am

I would check for apparmor denials in the kernel log. It is quite possible that some permissions are missing. If that is the case, you might need to add plugs to the snap, or propose changes to snapd interfaces as appropriate.

theseankelly · June 17, 2020, 3:14pm

Output from snapd in /var/log/syslog indicates that apparmor is not enabled in Linux for Tegra. Also I’ve installed with --devmode which I understand bypasses apparmor, no?

Perhaps this means I simply have failed to set up the NVIDIA userspace stack properly in my snap?

theseankelly · June 22, 2020, 7:12pm

Just confirming that there aren’t any apparmor denials in my kernel log. Used the snappy-debug tool too; nothing gets flagged. I suppose as expected given --devmode.