Nvidia GL libs access broken on Ubuntu 18.04

mborzecki
upcoming

#1

Hey,

I have a snap which uses EGL/OpenGL ES etc. Now I installed bionic on one of systems which breaks access of that snap to the host Nvidia GL libraries as bionic is now using GLVND and not shipping the nvidia driver in a dedicated /usr/lib/nvidia* directory anymore. With that, snap-confine does not etup /var/lib/snapd/gl inside the snap execution anymore and my snap has to rely on the build in Mesa GL bits from 16.04 which obviously fail on an Nvidia GPU driven system.

The problem should exist for any snap using a GL driver on a Nvidia system. Certain apps will propably fallback to a software renderer (chromium, …), but this is nothing we really want.

Are there already any plans to support GLVND driven systems with snapd/snap-confine?

For now I will switch that system back to exclusively use its Intel GPU.

Thanks.

regards,
Simon


Nvidia acceleration on chrome and firefox
Nvidia acceleration on chrome and firefox
#2

So it seems the days when snap-confine used build-time choice as to use glvnd vs monolithic directory are over. As a test you could recompile snap-confine with different build options, disable reexec and see if that makes nvidia work. The unfortunate thing is that we don’t have many nvidia users on the core team so testing would be spotty.

The real solution is to compile both features (sans-glvnd and with-glvnd) into snap-confine and make the detection a runtime option. This will also reduce the set of features that make reexec incompatible with some distributions so it is a good thing in general. The only question is how to detect which one to enable on a given system.

I think this is something we have to address before 18.04 ships. CC @mvo @JamieBennett


#3

Absolutely. Thanks for taking this up!


#4

As suggested on another thread (see [Nvidia Proprietary Driver] No H/W Acceleration in Chromium and Firefox (stack mashing problem also)) you can get easily access to a Nvidia GPU via AWS or Google Cloud.


#5

We will soon have a new GTX 1030 for debugging this. We never managed to use a cloud based solution because it was always married to a special kernel.


#6

That is fine but it seems like having a spread test with some mocking and the Nvidia driver installed should help to avoid regressions in this area in the future. From what I remember all snap-confine did to detect Nvidia are checks for file locations etc.


#7

Well, not necessarily-- can’t >=18.04 simply build with different options than <18.04?


#8

I don’t know the details, but wouldn’t snap-confine come from the core snap if it is newer than what the deb ships? If that is true we would end up in a mixed-world…


#9

Yes, you are right. We would need runtime detection to properly support re-exec.


#10

No because there’s just one version, the one in the core snap.


#11

There are two snap-confine that must be considered with re-exec, the one in the snap and the one in the deb. The snap will always have the 16.04 non-glvnd compile build options, but what I was saying is that 18.04 could have different build options for glvnd, but corrected myself since that wouldn’t work with reexec since the snap has the 16.04 non-glvnd build options. Even if we only ever used the snap-confine from the core snap though, it is going to have to handle systems with and without glvnd, so I was agreeing with you and @morphis that the static build options may need to go away… Am I missing something?


#12

No this is exactly right.

I think we will make snap-confine capable to do both and just give it hints as to which system to use at runtime. One of my desires would be to migrate the opengl support to the current layout/content/mount code so that snapd would get to decide.


#13

Since I’m doing all the nvidia fixes atm I’ve opened a PR with a RFC:


The idea is to autodetect if the host is using /usr/lib/<arch-triplet> and try to do the right thing.

FWIW ohmygiraffe works now.


#14

I’ve updated the PR with more elaborate detection. Nvidia drivers package in xenial (actually all versions before bionic) have both /usr/lib/<arch-triplet>, but the libraries are under /usr/lib/nvidia. Then in bionic, all libraries are under /usr/lib/<arch-triplet>.

The way things work now:

  • during configure time, pass --with-host-arch-triplet=x86_64-gnu-linux (optionally --with-host-arch-32bit-triplet=i386-gnu-linux) - this is picked up automatically in debian/rules and autogen.sh
  • use a well known nvidia library as a canary, in this case it’s libnvidia-glcore.so.<driver-major>.<driver-minor>, eg. libnvidia-glcore.so.390.42 in my system, 390.42 is the driver package version
  • at runtime
    • grab the current driver version by poking /sys (we had this code before)
    • check if /usr/lib/<arch-triplet>/libnvidia-glcore.so.%d.%d exists, if so symlink all the files like we do in biarch case
      • additionally check if /usr/lib/<arch-32bit-triplet>/libnvidia-glcore... exists, if so also symlink it at the proper location
    • when symlinking prefix path is preserved, eg.
      /usr/lib/<arch-triplet>/libnvidia-tls.so..  -> /var/lib/snapd/lib/gl/<arch-triplet>/libnvidia-tls.so..`
      /usr/lib/<arch-triplet>/tls/libnvidia-tls.so..  -> /var/lib/snapd/lib/gl/<arch-triplet>/tls/libnvidia-tls.so..
      
    • if the canary lib does not exist, falls back to bind mounting /usr/lib/nvidia-<driver-major> as we did before
  • apparmor profile was updated to allow snap-confine do create the prefix path as needed
  • opengl interface was updated to allow access to /var/lib/snapd/lib/gl/<arch-triplet>/tlslibnvidia-tls.so.., as this was causing unexplained segfaults that were debugged in the other nvidia thread already

Ohmygiraffe works with nvidia and confinement now.


Call for testing: Flare - an open source 2D RPG
#15

I’ve tested @mborzecki’s PR on my 16.04 laptop, running three games (ohmygiraffe, minecraft, and supertuxkart), and it works fine with both the nvidia card and the intel card as long as I discard the mount namespaces of opengl-using things when I switch PRIME.

In other words, works as expected, it’s either not a regression or an outright improvement (I hadn’t realised I needed to discard namespaces when switching, so haven’t tested that without this patch; before, it looked like things just didn’t work with intel).

The glxinfo in graphics-debug-tools-bboozzoo is still not finding the intel driver, though, which might be a bug in it (or in what we’re doing), but it’s still not a regression.


#16

What are you stating isn’t a regression? You tested on an unaffected system because the problem occurs on 18.04. Therefore your statement that “it works fine” is incorrect with regards to the bug being discussed here.


#17

Hmm, i guess there is no Vulkan Support here…
Arch Linux

➜  ~ graphics-debug-tools-bboozzoo.vulkaninfo
===========
VULKAN INFO
===========

Vulkan API Version: 1.0.61

Cannot create Vulkan instance.
/build/vulkan-YO5iw0/vulkan-1.0.61.1+dfsg1/demos/vulkaninfo.c:704: failed with VK_ERROR_INCOMPATIBLE_DRIVER

#18

Doesn’t he mean the patch for 18.04 doesn’t cause a regression in 16.04?


#19

Maybe. English is only my first language, so I’m likely to have misread :-p Words I do good readie writer :smiley:


#20

I’m aware. FWIW, it’s the same on Arch too, but I haven’t tried to debug it further yet.