Hello World CUDA Analysis

This thread contains the analysis of a hello-world CUDA application for the purpose of providing support CUDA in the GPU Support Proposal

Please check the end of this post for conclusions.


The application source code, compile with nvcc hello.cu:

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];

int main(void)
  int N = 1<<20; // 1M elements
  float *x, *y;

  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;

  // Run kernel on 1M elements on the CPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory

  return 0;

Shared Library Usage

Raw data collected with strace and ltrace, hand-edited for clarity and rendered with dot. Note that the dashed line indicates dlopen while solid lines indicate elf NEEDED entry in the dynamic section, that is, regular dynamic linking.

As can be seen CUDA is really encapsulated by dlopened libcuda.so.1 which only depends on libnvidia-fatbinaryloader.so.418.56. Apart from device access this is, apparently, all that is required for headless CUDA compute work. The location of both files can be seen in the next section.

Runtime File System Access

Raw data collected with strace, hand-edited for clarity and rendered with dot. Note that folders are represented with a stylised folder outline, devices are represented as cylinders while all other files are represented as undecorated names. Lines represent relationships of files in the filesystem hierarchy. Dashed lines indicate use of UNIX sockets.

The main access patterns revolve around:

  • read only access to libcuda.so.1 and libnvidia-fatbinaryloader.so.418.56
  • read only access to nvidia kernel module information in /proc and /sys
  • read/write access to /dev/nvidiactl, /dev/nvidia0 (likely /dev/nvidia1, etc, when more GPUs are present), /dev/nvidia-uvm

There is also some IPC, though in my case it was not necessary:

  • access to /dev/shm/cuda_injection_path_shm (no such file, purpose unclear)
  • UNIX socket connection to /tmp/nvidia-mps/control (see below)
  • Abstract socket listening on @cuda-uvmfd-4026531836-28299 (purpose unclear)

MPS refers to Multi-Process Service where hardware shared by multiple concurrent processes, for example in MPI applications. This is further documented on https://docs.nvidia.com/deploy/mps/index.html

Runtime Environment Variable Access

Raw data collected with ltrace and rendered with twopi

Note that getenv is really called with non-ASCII keys, this can be double-checked with gdb:

Thread 1 "a.out" hit Breakpoint 2, __GI_getenv (name=0x7ffff74c8730 "\377\211(xJ\264\375\377") at getenv.c:39


  1. The snapd-nvidia-418 snap should ship libcuda.so.1 and all of the dependencies not provided by the base snap.
  2. Sandbox Access granted by the OpenGL interface seems sufficient but may be broken out into dedicated interface if that is beneficial for headless server applications
  3. Interactions with IPC services needs further analysis, likely required for advanced setups
  4. Connection to UNIX socket in /tmp/nvidia-mps would need special support in snap-confine.