The second time I hear this now. You guys can get easily access to a Nvidia GPU via AWS or Google Cloud. Only tried AWS some time ago and work well enough.
Btw. a similar problem appears for the upcoming Ubuntu 18.04 and makes snaps (with hw acceleration) pretty much unusable and with that upgrading a no-go on Nvidia-only systems: Nvidia GL libs access broken on Ubuntu 18.04
FWIW. I made an experimental snap with some tools to debug issues with graphics, it’s called graphics-debug-tools-bboozzoo and currently released only to edge channel.
maciek@galeon:~ snap run --gdb graphics-debug-tools-bboozzoo.glxinfo
...
Thread 1 "glxinfo" received signal SIGSEGV, Segmentation fault.
0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007feae2477df7 in __glDispatchNewVendorID () from target:/var/lib/snapd/lib/gl/libGLdispatch.so.0
#2 0x00007feae29093c2 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#3 0x00007feae290a718 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#4 0x00007feae2903ec2 in glXChooseVisual () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#5 0x0000000000401741 in ?? ()
#6 0x00007feae2b52830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#7 0x0000000000401ea9 in ?? ()
The explosions are almost certainly caused by the fact that the GL stack has changed to use glvnd now. We’ve had similar problems in Fedora for a while now, but analyzing glvnd issues hasn’t been a priority until Ubuntu finally got it with Ubuntu 18.04…
I can just second this. With GLVND the Nvidia driver binaries are on Ubuntu not any longer in /usr/lib/nvidia-* but further down in the tree. I guess something similar happened for Arch. I guess @zyga-snapd already has a plan for this as mentioned in Nvidia GL libs access broken on Ubuntu 18.04
(gdb) bt
#0 0x00007ffff7559fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2 0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3 0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4 0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:442
#5 0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6 0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7 0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8 0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9 0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000401ea9 in ?? ()
reordered __glDispatchNewVendorID to be called right after memcpy
got stack smashing and ABRT higher up the stack, brilliant idea, rebuild with stack protector and see if the problem happens earlier
rebuilt libglvnd with CFLAGS='-O0 -ggdb -fstack-protector -fstack-protector-all'
Small update. Per @niemeyer’s advice to try a simpler approach, I’ve set up a xenial chroot, dumped the libglvnd I built before and copied over nvidia drivers that came from Arch packages. I was able to reproduce the segfault without much trouble. The backtrace:
(gdb) bt
#0 0x00007ffff7559fb4 in pthread_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at forward.c:192
#1 0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2 0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3 0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4 0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x618ad0 "nvidia") at libglxmapping.c:442
#5 0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6 0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7 0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8 0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9 0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main (main=0x401630, argc=1, argv=0x7fffffffe608, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5f8) at ../csu/libc-start.c:291
#11 0x0000000000401ea9 in ?? ()
The upside is that at least I can install the usual debugging tools now and try to dig deeper.
Turns out nvidia ships a couple of libraries that may fiddle with TLS or at leat that’s what the name libnvidia-tls.so* suggests. There are 2 copies of the libraries (at least on Arch), one under /usr/lib, and another under /usr/lib/tls:
The libraries under tls have different checnksum than those one level up. Since the location should not matter for ld.so and we prepend the whole /var/lib/snapd/lib/gl path, I ignored those files. But, copying over the /usr/lib/tls magically fixed the problem, no more segfaults, glxinfo works, and so does ohmygiraffe.