Nvidia acceleration on chrome and firefox

Ok, the card got delivered. I have the following packages installed:

maciek@galeon:~ pacman -Qs nvidia
local/lib32-libvdpau 1.1.1-2
    Nvidia VDPAU library
local/lib32-nvidia-utils 390.42-1
    NVIDIA drivers utilities (32-bit)
local/libvdpau 1.1.1+3+ga21bf7a-1
    Nvidia VDPAU library
local/libxnvctrl 390.42-1
    NVIDIA NV-CONTROL X extension
local/nvidia 390.42-3
    NVIDIA drivers for linux
local/nvidia-lts 1:390.42-1
    NVIDIA drivers for linux-lts
local/nvidia-settings 390.42-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 390.42-1
    NVIDIA drivers utilities
local/xf86-video-nouveau 1.0.15-2 (xorg-drivers)
    Open Source 2D acceleration driver for nVidia cards

And it’s crashing as reported. Will try to debug it.

2 Likes

FWIW. I made an experimental snap with some tools to debug issues with graphics, it’s called graphics-debug-tools-bboozzoo and currently released only to edge channel.

maciek@galeon:~ snap run --gdb graphics-debug-tools-bboozzoo.glxinfo
...
Thread 1 "glxinfo" received signal SIGSEGV, Segmentation fault.
0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007feae2477df7 in __glDispatchNewVendorID () from target:/var/lib/snapd/lib/gl/libGLdispatch.so.0
#2  0x00007feae29093c2 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#3  0x00007feae290a718 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#4  0x00007feae2903ec2 in glXChooseVisual () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#5  0x0000000000401741 in ?? ()
#6  0x00007feae2b52830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000401ea9 in ?? ()

glvnd?

The explosions are almost certainly caused by the fact that the GL stack has changed to use glvnd now. We’ve had similar problems in Fedora for a while now, but analyzing glvnd issues hasn’t been a priority until Ubuntu finally got it with Ubuntu 18.04…

Another run of glxinfo with ASLR disabled to get stable addresses. Backtrace: https://paste.ubuntu.com/p/TwRRRVZHYs/ strace: https://paste.ubuntu.com/p/QdsZHM2rnz/

I can just second this. With GLVND the Nvidia driver binaries are on Ubuntu not any longer in /usr/lib/nvidia-* but further down in the tree. I guess something similar happened for Arch. I guess @zyga-snapd already has a plan for this as mentioned in Nvidia GL libs access broken on Ubuntu 18.04

Spent some time debugging this and I’m close to being stuck.

What I did (IOW diary of a madman):

(gdb) bt
#0  0x00007ffff7559fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2  0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3  0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4  0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:442
#5  0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6  0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7  0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8  0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9  0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000401ea9 in ?? ()
  • reordered __glDispatchNewVendorID to be called right after memcpy
  • got stack smashing and ABRT higher up the stack, brilliant idea, rebuild with stack protector and see if the problem happens earlier
  • rebuilt libglvnd with CFLAGS='-O0 -ggdb -fstack-protector -fstack-protector-all'
  • dump
  • get stack smashing again
  • gdb again, ended up with this:
B+ │0x7ffff72117bb <__glXLookupVendorByName+19>     mov    %fs:0x28,%rax                                              │
   │0x7ffff72117c4 <__glXLookupVendorByName+28>     mov    %rax,-0x18(%rbp)                                           │
                                                canary set ^^^
  >│0x7ffff72117c8 <__glXLookupVendorByName+32>     xor    %eax,%eax

(gdb) print/x $rax
$2 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)($rbp -0x18)
$4 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)(0x7fffffffde28)
$5 = 0xe8e2f5458058dc00
(gdb) print/x $rax
$7 = 0xe8e2f5458058dc00  <--- %fs:0x28
# try to catch return
(gdb) b libglxmapping.c:509
Breakpoint 4 at 0x7ffff72135fc: file libglxmapping.c, line 509.
(gdb) c
Continuing.

Thread 1 "glxinfo" hit Breakpoint 4, __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:509

  >│0x7ffff721365e <__glXLookupVendorByName+7862>   mov    -0x18(%rbp),%rbx                                           │
   │0x7ffff7213662 <__glXLookupVendorByName+7866>   xor    %fs:0x28,%rbx                                              │
   │0x7ffff721366b <__glXLookupVendorByName+7875>   je     0x7ffff7213672 <__glXLookupVendorByName+7882>              │
                                       canary check ^^^
   │0x7ffff721366d <__glXLookupVendorByName+7877>   callq  0x7ffff7208cf0 <__stack_chk_fail@plt>

(gdb) print/x *(uint64_t*)(0x7fffffffde28)  <-- canary unchanged?
$8 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$9 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$10 = 0xe8e28aba75a88cc0 <-- xor %fs:0x28,%rbx, $fs:0x28 must have been 0x7ffff5f050c0 now?

From what I understand $fs is related to TLS. The problem may then go deeper into libc and pthread.
Would appreciate any advice on what to try next.

1 Like

Solus also uses glvnd and snaps work fine.
@ikey are you using some kind of dark magic or what?

Something I totally didn’t notice:

[Mar21 06:56] traps: glxinfo[29277] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]
[Mar21 07:05] traps: glxinfo[30174] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]
[Mar21 07:20] traps: glxinfo[31424] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]

Small update. Per @niemeyer’s advice to try a simpler approach, I’ve set up a xenial chroot, dumped the libglvnd I built before and copied over nvidia drivers that came from Arch packages. I was able to reproduce the segfault without much trouble. The backtrace:

(gdb) bt
#0  0x00007ffff7559fb4 in pthread_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at forward.c:192
#1  0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2  0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3  0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4  0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x618ad0 "nvidia") at libglxmapping.c:442
#5  0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6  0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7  0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8  0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9  0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main (main=0x401630, argc=1, argv=0x7fffffffe608, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5f8) at ../csu/libc-start.c:291
#11 0x0000000000401ea9 in ?? ()

The upside is that at least I can install the usual debugging tools now and try to dig deeper.

Turns out nvidia ships a couple of libraries that may fiddle with TLS or at leat that’s what the name libnvidia-tls.so* suggests. There are 2 copies of the libraries (at least on Arch), one under /usr/lib, and another under /usr/lib/tls:

lrwxrwxrwx 1 root root    23 03-14 07:07 /usr/lib/libnvidia-tls.so -> libnvidia-tls.so.390.42
-rwxr-xr-x 1 root root 13080 03-14 07:07 /usr/lib/libnvidia-tls.so.390.42
lrwxrwxrwx 1 root root    23 03-14 07:07 /usr/lib/tls/libnvidia-tls.so -> libnvidia-tls.so.390.42
-rwxr-xr-x 1 root root 14480 03-14 07:07 /usr/lib/tls/libnvidia-tls.so.390.42

The libraries under tls have different checnksum than those one level up. Since the location should not matter for ld.so and we prepend the whole /var/lib/snapd/lib/gl path, I ignored those files. But, copying over the /usr/lib/tls magically fixed the problem, no more segfaults, glxinfo works, and so does ohmygiraffe.

I’ve opened a PR with updated snap-confine globs:
https://github.com/snapcore/snapd/pull/4901

It’d be great if someone on Ubuntu, Debian, Solus, Fedora or other distro could check if the PR does not break things for them.

2 Likes

if you can help me… on how to try it. i can be helpful than :slight_smile:

You can install snapd-git from AUR, it builds the latest master.

Another PR, this time make sure that we preserve the original layout of nvidia libs:
https://github.com/snapcore/snapd/pull/4902

Instead of mixed up libraries:

.:
total 0
lrwxrwxrwx 1 root maciek 23 Mar 22 08:41 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek 57 Mar 22 08:41 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/tls/libnvidia-tls.so.390.42

We should get the mirrored structure:

.:
total 0
lrwxrwxrwx 1 root maciek  23 Mar 22 09:43 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek  53 Mar 22 09:43 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/libnvidia-tls.so.390.42
drwxr-xr-x 2 root maciek  80 Mar 22 09:43 tls

./tls:
total 0
lrwxrwxrwx 1 root maciek 23 Mar 22 09:43 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek 57 Mar 22 09:43 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/tls/libnvidia-tls.so.390.42
1 Like

It’s resolved :slight_smile: Thanks, was facing this issue for a while

Edited: This fixed most of the application (vlc, ppsspp and games etc) :slight_smile:

5 Likes

On both Arch and Manjaro with Nvidia 380 or 390 series drivers snaps using hardware acceleration are crashing:
Console output

Spotify somewhat works but it takes long time to load and hardware acceleration is disabled.
Ohmygiraffe doesn’t work at all.

snap run --gdb doesn’t find stacktrace and coredumpctl doesn’t show anything really useful.

Note: Solus with the same driver versions works fine.

ping @mborzecki

Possibly related Wine bug about Nvidia drivers: https://bugs.winehq.org/show_bug.cgi?id=43530

@niemeyer can you merge this topic to Nvidia acceleration on chrome and firefox ?

@mborzecki Sure, that’s done.

Hello, using ubuntu 18.04, snap version 16-2.32+git622.ab40e67 and nvidia 390 driver.

I have issues with spotify and some other snaps, issue looks like this:

snap run spotify
failed to create prefix path: /tmp/snap.rootfs_zTBDGl/var/lib/snapd/lib/vulkan/icd.d: Permission denied

snap run flare-rpg
failed to create prefix path: /tmp/snap.rootfs_smuH38/var/lib/snapd/lib/vulkan/icd.d: Permission denied

But skype and atom works.

If i switch to intel card all snaps works. Not sure if i’m reporting my issue in correct thread, if no please let me know.