Nvidia acceleration on chrome and firefox

Syntist · March 17, 2018, 5:09pm

No H/W Acceleration on many apps, and stack smashing problem causing application to terminate…
(VLC, PPSSPP and etc).

System:

➜  ~ screenfetch
                   -`                 
                  .o+`                 syntist@Syntist-PC
                 `ooo/                 OS: Arch Linux 
                `+oooo:                Kernel: x86_64 Linux 4.15.9-1-ARCH
               `+oooooo:               Uptime: 2h 15m
               -+oooooo+:              Packages: 796
             `/:-:++oooo+:             Shell: zsh 5.4.2
            `/++++/+++++++:            Resolution: 1920x1080
           `/++++++++++++++:           DE: GNOME 
          `/+++ooooooooooooo/`         WM: GNOME Shell
         ./ooosssso++osssssso+`        WM Theme: Adapta
        .oossssso-````/ossssss+`       GTK Theme: Adapta [GTK2/3]
       -osssssso.      :ssssssso.      Icon Theme: Papirus-Adapta
      :osssssss/        osssso+++.     Font: Ubuntu 11
     /ossssssss/        +ssssooo/-     CPU: Intel Core i5-3570 @ 4x 3.8GHz [27.8°C]
   `/ossssso+/:-        -:/+osssso+-   GPU: GeForce GT 640
  `+sso+:-`                 `.-/+oso:  RAM: 4424MiB / 11956MiB
 `++:.                           `-/+/
 .`                                 `/

➜ ~ snap version
snap 2.31.2-1
snapd 2.31.2-1
series 16
arch
kernel 4.15.9-1-ARCH

Driver Version:

➜  ~ modinfo nvidia | grep version 
version:        390.42
srcversion:     FA33B00C00A6F70EC9CF314
vermagic:       4.15.9-1-ARCH SMP preempt mod_unload modversions

mborzecki · March 19, 2018, 6:12am

The problem has also been reported here:

So far we do not have any idea what might be causing this. Not having an nvidia card on the team makes it a bit harder to debug.

Can you try ohmygiraffe, run it with snap run --gdb ohmygiraffe and collect the backtrace?

morphis · March 19, 2018, 7:52am

The second time I hear this now. You guys can get easily access to a Nvidia GPU via AWS or Google Cloud. Only tried AWS some time ago and work well enough.

Btw. a similar problem appears for the upcoming Ubuntu 18.04 and makes snaps (with hw acceleration) pretty much unusable and with that upgrading a no-go on Nvidia-only systems: Nvidia GL libs access broken on Ubuntu 18.04

mborzecki · March 19, 2018, 7:58am

AWS and GCP use a custom kernel and we didn’t like that for some reason, cannot recall the exact argumentation though.

Anyways, I’ve ordered a cheap GT1030 GPU. Hopefully I will be able to get into debugging this tomorrow or on Wednesday.

Syntist · March 19, 2018, 12:07pm

Maybe it’s related to Arch Linux build of snapd?

➜ ~ snap run ohmygiraffe
AL lib: (WW) ReadALConfig: Ignoring XDG config dir:
/snap/ohmygiraffe/3/bin/launch_omg: line 59: 15151 Segmentation fault (core dumped) $SNAP/usr/bin/love $SNAP/oh-my-giraffe.love

➜ ~ snap run --gdb ohmygiraffe
error: unknown flag `gdb’

strace:
https://pastecode.xyz/view/c0179e0c

mborzecki · March 20, 2018, 4:38pm

Ok, the card got delivered. I have the following packages installed:

maciek@galeon:~ pacman -Qs nvidia
local/lib32-libvdpau 1.1.1-2
    Nvidia VDPAU library
local/lib32-nvidia-utils 390.42-1
    NVIDIA drivers utilities (32-bit)
local/libvdpau 1.1.1+3+ga21bf7a-1
    Nvidia VDPAU library
local/libxnvctrl 390.42-1
    NVIDIA NV-CONTROL X extension
local/nvidia 390.42-3
    NVIDIA drivers for linux
local/nvidia-lts 1:390.42-1
    NVIDIA drivers for linux-lts
local/nvidia-settings 390.42-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 390.42-1
    NVIDIA drivers utilities
local/xf86-video-nouveau 1.0.15-2 (xorg-drivers)
    Open Source 2D acceleration driver for nVidia cards

And it’s crashing as reported. Will try to debug it.

mborzecki · March 20, 2018, 4:56pm

FWIW. I made an experimental snap with some tools to debug issues with graphics, it’s called graphics-debug-tools-bboozzoo and currently released only to edge channel.

mborzecki · March 20, 2018, 4:59pm

maciek@galeon:~ snap run --gdb graphics-debug-tools-bboozzoo.glxinfo
...
Thread 1 "glxinfo" received signal SIGSEGV, Segmentation fault.
0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007feae2c46fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007feae2477df7 in __glDispatchNewVendorID () from target:/var/lib/snapd/lib/gl/libGLdispatch.so.0
#2  0x00007feae29093c2 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#3  0x00007feae290a718 in ?? () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#4  0x00007feae2903ec2 in glXChooseVisual () from target:/var/lib/snapd/lib/gl/libGLX.so.0
#5  0x0000000000401741 in ?? ()
#6  0x00007feae2b52830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000401ea9 in ?? ()

glvnd?

Conan_Kudo · March 20, 2018, 5:12pm

The explosions are almost certainly caused by the fact that the GL stack has changed to use glvnd now. We’ve had similar problems in Fedora for a while now, but analyzing glvnd issues hasn’t been a priority until Ubuntu finally got it with Ubuntu 18.04…

mborzecki · March 20, 2018, 5:15pm

Another run of glxinfo with ASLR disabled to get stable addresses. Backtrace: https://paste.ubuntu.com/p/TwRRRVZHYs/ strace: https://paste.ubuntu.com/p/QdsZHM2rnz/

morphis · March 20, 2018, 9:04pm

I can just second this. With GLVND the Nvidia driver binaries are on Ubuntu not any longer in /usr/lib/nvidia-* but further down in the tree. I guess something similar happened for Arch. I guess @zyga-snapd already has a plan for this as mentioned in Nvidia GL libs access broken on Ubuntu 18.04

mborzecki · March 21, 2018, 12:02pm

Spent some time debugging this and I’m close to being stuck.

What I did (IOW diary of a madman):

checked Arch’s PKGBUILD of nvidia, no compilation there, just repacking, https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/nvidia-utils
NVIDIA drivers do glxvnd now, https://devtalk.nvidia.com/default/topic/915640/unix-graphics-announcements-and-news/multiple-glx-client-libraries-in-the-nvidia-linux-driver-installer-package/
Arch package is clearly done with glvnd in mind
rebuilt libglvnd from master on xenial
used `./configure --prefix=/usr --enable-debug=yes CFLAGS=’-O0 -ggdb’
tarred whatever was under $(DESTDIR)/usr/lib
on the host: sudo nsenter -m/run/snapd/ns/graphics-debug-tools-bboozzoo.mnt
while in /var/lib/snapd/lib/gl:
- tar tf ~maciek/work/canonical/libglvnd.tar | xargs rm -f
- tar xvf ~maciek/work/canonical/libglvnd.tar
- checked that libGL* are local files
disabled ASLR to get predictable addresses
snap run --gdb graphics-debug-tools-bboozzoo.glxinfo
got segfault in pthread_mutex_lock(), pointing to here: https://github.com/NVIDIA/libglvnd/blob/master/src/GLX/libglxmapping.c#L442 (__glXLookupVendorByName) the place did not make any sense, backtrace below:

(gdb) bt
#0  0x00007ffff7559fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2  0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3  0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4  0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:442
#5  0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6  0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7  0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8  0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9  0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000401ea9 in ?? ()

reordered __glDispatchNewVendorID to be called right after memcpy
got stack smashing and ABRT higher up the stack, brilliant idea, rebuild with stack protector and see if the problem happens earlier
rebuilt libglvnd with CFLAGS='-O0 -ggdb -fstack-protector -fstack-protector-all'
dump
get stack smashing again
gdb again, ended up with this:

B+ │0x7ffff72117bb <__glXLookupVendorByName+19>     mov    %fs:0x28,%rax                                              │
   │0x7ffff72117c4 <__glXLookupVendorByName+28>     mov    %rax,-0x18(%rbp)                                           │
                                                canary set ^^^
  >│0x7ffff72117c8 <__glXLookupVendorByName+32>     xor    %eax,%eax

(gdb) print/x $rax
$2 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)($rbp -0x18)
$4 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)(0x7fffffffde28)
$5 = 0xe8e2f5458058dc00
(gdb) print/x $rax
$7 = 0xe8e2f5458058dc00  <--- %fs:0x28
# try to catch return
(gdb) b libglxmapping.c:509
Breakpoint 4 at 0x7ffff72135fc: file libglxmapping.c, line 509.
(gdb) c
Continuing.

Thread 1 "glxinfo" hit Breakpoint 4, __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:509

  >│0x7ffff721365e <__glXLookupVendorByName+7862>   mov    -0x18(%rbp),%rbx                                           │
   │0x7ffff7213662 <__glXLookupVendorByName+7866>   xor    %fs:0x28,%rbx                                              │
   │0x7ffff721366b <__glXLookupVendorByName+7875>   je     0x7ffff7213672 <__glXLookupVendorByName+7882>              │
                                       canary check ^^^
   │0x7ffff721366d <__glXLookupVendorByName+7877>   callq  0x7ffff7208cf0 <__stack_chk_fail@plt>

(gdb) print/x *(uint64_t*)(0x7fffffffde28)  <-- canary unchanged?
$8 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$9 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$10 = 0xe8e28aba75a88cc0 <-- xor %fs:0x28,%rbx, $fs:0x28 must have been 0x7ffff5f050c0 now?

From what I understand $fs is related to TLS. The problem may then go deeper into libc and pthread.
Would appreciate any advice on what to try next.

mati865 · March 21, 2018, 12:23pm

Solus also uses glvnd and snaps work fine.
@ikey are you using some kind of dark magic or what?

mborzecki · March 21, 2018, 2:01pm

Something I totally didn’t notice:

[Mar21 06:56] traps: glxinfo[29277] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]
[Mar21 07:05] traps: glxinfo[30174] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]
[Mar21 07:20] traps: glxinfo[31424] general protection ip:7ffff7559fb4 sp:7fffffffde78 error:0 in libc-2.23.so[7ffff7445000+1c0000]

mborzecki · March 21, 2018, 5:34pm

Small update. Per @niemeyer’s advice to try a simpler approach, I’ve set up a xenial chroot, dumped the libglvnd I built before and copied over nvidia drivers that came from Arch packages. I was able to reproduce the segfault without much trouble. The backtrace:

(gdb) bt
#0  0x00007ffff7559fb4 in pthread_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at forward.c:192
#1  0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2  0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3  0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4  0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x618ad0 "nvidia") at libglxmapping.c:442
#5  0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6  0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7  0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8  0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9  0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main (main=0x401630, argc=1, argv=0x7fffffffe608, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5f8) at ../csu/libc-start.c:291
#11 0x0000000000401ea9 in ?? ()

The upside is that at least I can install the usual debugging tools now and try to dig deeper.

mborzecki · March 22, 2018, 6:49am

Turns out nvidia ships a couple of libraries that may fiddle with TLS or at leat that’s what the name libnvidia-tls.so* suggests. There are 2 copies of the libraries (at least on Arch), one under /usr/lib, and another under /usr/lib/tls:

lrwxrwxrwx 1 root root    23 03-14 07:07 /usr/lib/libnvidia-tls.so -> libnvidia-tls.so.390.42
-rwxr-xr-x 1 root root 13080 03-14 07:07 /usr/lib/libnvidia-tls.so.390.42
lrwxrwxrwx 1 root root    23 03-14 07:07 /usr/lib/tls/libnvidia-tls.so -> libnvidia-tls.so.390.42
-rwxr-xr-x 1 root root 14480 03-14 07:07 /usr/lib/tls/libnvidia-tls.so.390.42

The libraries under tls have different checnksum than those one level up. Since the location should not matter for ld.so and we prepend the whole /var/lib/snapd/lib/gl path, I ignored those files. But, copying over the /usr/lib/tls magically fixed the problem, no more segfaults, glxinfo works, and so does ohmygiraffe.

I’ve opened a PR with updated snap-confine globs:
https://github.com/snapcore/snapd/pull/4901

It’d be great if someone on Ubuntu, Debian, Solus, Fedora or other distro could check if the PR does not break things for them.

Syntist · March 22, 2018, 8:26am

if you can help me… on how to try it. i can be helpful than

mborzecki · March 22, 2018, 8:46am

You can install snapd-git from AUR, it builds the latest master.

mborzecki · March 22, 2018, 8:51am

Another PR, this time make sure that we preserve the original layout of nvidia libs:
https://github.com/snapcore/snapd/pull/4902

Instead of mixed up libraries:

.:
total 0
lrwxrwxrwx 1 root maciek 23 Mar 22 08:41 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek 57 Mar 22 08:41 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/tls/libnvidia-tls.so.390.42

We should get the mirrored structure:

.:
total 0
lrwxrwxrwx 1 root maciek  23 Mar 22 09:43 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek  53 Mar 22 09:43 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/libnvidia-tls.so.390.42
drwxr-xr-x 2 root maciek  80 Mar 22 09:43 tls

./tls:
total 0
lrwxrwxrwx 1 root maciek 23 Mar 22 09:43 libnvidia-tls.so -> libnvidia-tls.so.390.42
lrwxrwxrwx 1 root maciek 57 Mar 22 09:43 libnvidia-tls.so.390.42 -> /var/lib/snapd/hostfs/usr/lib/tls/libnvidia-tls.so.390.42

Syntist · March 22, 2018, 9:13am

It’s resolved Thanks, was facing this issue for a while

Edited: This fixed most of the application (vlc, ppsspp and games etc)