Spent some time debugging this and I’m close to being stuck.
What I did (IOW diary of a madman):
- checked Arch’s PKGBUILD of nvidia, no compilation there, just repacking, https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/nvidia-utils
- NVIDIA drivers do glxvnd now, https://devtalk.nvidia.com/default/topic/915640/unix-graphics-announcements-and-news/multiple-glx-client-libraries-in-the-nvidia-linux-driver-installer-package/
- Arch package is clearly done with glvnd in mind
- rebuilt libglvnd from master on xenial
- used `./configure --prefix=/usr --enable-debug=yes CFLAGS=’-O0 -ggdb’
- tarred whatever was under
$(DESTDIR)/usr/lib
- on the host:
sudo nsenter -m/run/snapd/ns/graphics-debug-tools-bboozzoo.mnt
- while in
/var/lib/snapd/lib/gl
:tar tf ~maciek/work/canonical/libglvnd.tar | xargs rm -f
tar xvf ~maciek/work/canonical/libglvnd.tar
- checked that libGL* are local files
- disabled ASLR to get predictable addresses
snap run --gdb graphics-debug-tools-bboozzoo.glxinfo
- got segfault in
pthread_mutex_lock()
, pointing to here: https://github.com/NVIDIA/libglvnd/blob/master/src/GLX/libglxmapping.c#L442 (__glXLookupVendorByName
) the place did not make any sense, backtrace below:
(gdb) bt
#0 0x00007ffff7559fb4 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6f5eddb in mt_mutex_lock (mutex=0x7ffff71e5180 <dispatchLock>) at glvnd_pthread.c:317
#2 0x00007ffff6f23f77 in LockDispatch () at GLdispatch.c:144
#3 0x00007ffff6f24115 in __glDispatchNewVendorID () at GLdispatch.c:198
#4 0x00007ffff7212607 in __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:442
#5 0x00007ffff7213811 in __glXLookupVendorByScreen (dpy=0x60aab0, screen=0) at libglxmapping.c:574
#6 0x00007ffff7213966 in __glXGetDynDispatch (dpy=0x60aab0, screen=0) at libglxmapping.c:608
#7 0x00007ffff7209563 in glXChooseVisual (dpy=0x60aab0, screen=0, attrib_list=0x609200) at libglx.c:215
#8 0x00007ffff7b89d58 in glXChooseVisual (dpy=0x60aab0, screen=0, attribList=0x609200) at g_libglglxwrapper.c:183
#9 0x0000000000401741 in ?? ()
#10 0x00007ffff7465830 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000401ea9 in ?? ()
- reordered
__glDispatchNewVendorID
to be called right aftermemcpy
- got stack smashing and ABRT higher up the stack, brilliant idea, rebuild with stack protector and see if the problem happens earlier
- rebuilt libglvnd with
CFLAGS='-O0 -ggdb -fstack-protector -fstack-protector-all'
- dump
- get stack smashing again
- gdb again, ended up with this:
B+ │0x7ffff72117bb <__glXLookupVendorByName+19> mov %fs:0x28,%rax │
│0x7ffff72117c4 <__glXLookupVendorByName+28> mov %rax,-0x18(%rbp) │
canary set ^^^
>│0x7ffff72117c8 <__glXLookupVendorByName+32> xor %eax,%eax
(gdb) print/x $rax
$2 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)($rbp -0x18)
$4 = 0xe8e2f5458058dc00
(gdb) print/x *(uint64_t*)(0x7fffffffde28)
$5 = 0xe8e2f5458058dc00
(gdb) print/x $rax
$7 = 0xe8e2f5458058dc00 <--- %fs:0x28
# try to catch return
(gdb) b libglxmapping.c:509
Breakpoint 4 at 0x7ffff72135fc: file libglxmapping.c, line 509.
(gdb) c
Continuing.
Thread 1 "glxinfo" hit Breakpoint 4, __glXLookupVendorByName (vendorName=0x60d160 "nvidia") at libglxmapping.c:509
>│0x7ffff721365e <__glXLookupVendorByName+7862> mov -0x18(%rbp),%rbx │
│0x7ffff7213662 <__glXLookupVendorByName+7866> xor %fs:0x28,%rbx │
│0x7ffff721366b <__glXLookupVendorByName+7875> je 0x7ffff7213672 <__glXLookupVendorByName+7882> │
canary check ^^^
│0x7ffff721366d <__glXLookupVendorByName+7877> callq 0x7ffff7208cf0 <__stack_chk_fail@plt>
(gdb) print/x *(uint64_t*)(0x7fffffffde28) <-- canary unchanged?
$8 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$9 = 0xe8e2f5458058dc00
(gdb) stepi
(gdb) print/x $rbx
$10 = 0xe8e28aba75a88cc0 <-- xor %fs:0x28,%rbx, $fs:0x28 must have been 0x7ffff5f050c0 now?
From what I understand $fs
is related to TLS. The problem may then go deeper into libc and pthread.
Would appreciate any advice on what to try next.