Nvidia GL libs access broken on Ubuntu 18.04

morphis · March 12, 2018, 6:12am

Hey,

I have a snap which uses EGL/OpenGL ES etc. Now I installed bionic on one of systems which breaks access of that snap to the host Nvidia GL libraries as bionic is now using GLVND and not shipping the nvidia driver in a dedicated /usr/lib/nvidia* directory anymore. With that, snap-confine does not etup /var/lib/snapd/gl inside the snap execution anymore and my snap has to rely on the build in Mesa GL bits from 16.04 which obviously fail on an Nvidia GPU driven system.

The problem should exist for any snap using a GL driver on a Nvidia system. Certain apps will propably fallback to a software renderer (chromium, …), but this is nothing we really want.

Are there already any plans to support GLVND driven systems with snapd/snap-confine?

For now I will switch that system back to exclusively use its Intel GPU.

Thanks.

regards,
Simon

zyga-snapd · March 12, 2018, 9:08am

So it seems the days when snap-confine used build-time choice as to use glvnd vs monolithic directory are over. As a test you could recompile snap-confine with different build options, disable reexec and see if that makes nvidia work. The unfortunate thing is that we don’t have many nvidia users on the core team so testing would be spotty.

The real solution is to compile both features (sans-glvnd and with-glvnd) into snap-confine and make the detection a runtime option. This will also reduce the set of features that make reexec incompatible with some distributions so it is a good thing in general. The only question is how to detect which one to enable on a given system.

I think this is something we have to address before 18.04 ships. CC @mvo @JamieBennett

morphis · March 12, 2018, 9:52am

Absolutely. Thanks for taking this up!

morphis · March 19, 2018, 7:53am

As suggested on another thread (see Nvidia acceleration on chrome and firefox - #2 by mborzecki) you can get easily access to a Nvidia GPU via AWS or Google Cloud.

zyga-snapd · March 19, 2018, 11:05am

We will soon have a new GTX 1030 for debugging this. We never managed to use a cloud based solution because it was always married to a special kernel.

morphis · March 19, 2018, 2:39pm

That is fine but it seems like having a spread test with some mocking and the Nvidia driver installed should help to avoid regressions in this area in the future. From what I remember all snap-confine did to detect Nvidia are checks for file locations etc.

jdstrand · March 20, 2018, 9:43pm

Well, not necessarily-- can’t >=18.04 simply build with different options than <18.04?

morphis · March 21, 2018, 7:52am

I don’t know the details, but wouldn’t snap-confine come from the core snap if it is newer than what the deb ships? If that is true we would end up in a mixed-world…

jdstrand · March 21, 2018, 1:16pm

Yes, you are right. We would need runtime detection to properly support re-exec.

zyga-snapd · March 21, 2018, 1:33pm

No because there’s just one version, the one in the core snap.

jdstrand · March 21, 2018, 2:14pm

There are two snap-confine that must be considered with re-exec, the one in the snap and the one in the deb. The snap will always have the 16.04 non-glvnd compile build options, but what I was saying is that 18.04 could have different build options for glvnd, but corrected myself since that wouldn’t work with reexec since the snap has the 16.04 non-glvnd build options. Even if we only ever used the snap-confine from the core snap though, it is going to have to handle systems with and without glvnd, so I was agreeing with you and @morphis that the static build options may need to go away… Am I missing something?

zyga-snapd · March 21, 2018, 2:23pm

No this is exactly right.

I think we will make snap-confine capable to do both and just give it hints as to which system to use at runtime. One of my desires would be to migrate the opengl support to the current layout/content/mount code so that snapd would get to decide.

mborzecki · March 22, 2018, 4:01pm

Since I’m doing all the nvidia fixes atm I’ve opened a PR with a RFC:
https://github.com/snapcore/snapd/pull/4908
The idea is to autodetect if the host is using /usr/lib/<arch-triplet> and try to do the right thing.

FWIW ohmygiraffe works now.

mborzecki · March 23, 2018, 12:52pm

I’ve updated the PR with more elaborate detection. Nvidia drivers package in xenial (actually all versions before bionic) have both /usr/lib/<arch-triplet>, but the libraries are under /usr/lib/nvidia. Then in bionic, all libraries are under /usr/lib/<arch-triplet>.

The way things work now:

during configure time, pass --with-host-arch-triplet=x86_64-gnu-linux (optionally --with-host-arch-32bit-triplet=i386-gnu-linux) - this is picked up automatically in debian/rules and autogen.sh
use a well known nvidia library as a canary, in this case it’s libnvidia-glcore.so.<driver-major>.<driver-minor>, eg. libnvidia-glcore.so.390.42 in my system, 390.42 is the driver package version
at runtime
- grab the current driver version by poking /sys (we had this code before)
- check if /usr/lib/<arch-triplet>/libnvidia-glcore.so.%d.%d exists, if so symlink all the files like we do in biarch case
  - additionally check if /usr/lib/<arch-32bit-triplet>/libnvidia-glcore... exists, if so also symlink it at the proper location
- when symlinking prefix path is preserved, eg.
```
/usr/lib/<arch-triplet>/libnvidia-tls.so..  -> /var/lib/snapd/lib/gl/<arch-triplet>/libnvidia-tls.so..`
/usr/lib/<arch-triplet>/tls/libnvidia-tls.so..  -> /var/lib/snapd/lib/gl/<arch-triplet>/tls/libnvidia-tls.so..
```
- if the canary lib does not exist, falls back to bind mounting /usr/lib/nvidia-<driver-major> as we did before
apparmor profile was updated to allow snap-confine do create the prefix path as needed
opengl interface was updated to allow access to /var/lib/snapd/lib/gl/<arch-triplet>/tlslibnvidia-tls.so.., as this was causing unexplained segfaults that were debugged in the other nvidia thread already

Ohmygiraffe works with nvidia and confinement now.

chipaca · March 23, 2018, 3:56pm

I’ve tested @mborzecki’s PR on my 16.04 laptop, running three games (ohmygiraffe, minecraft, and supertuxkart), and it works fine with both the nvidia card and the intel card as long as I discard the mount namespaces of opengl-using things when I switch PRIME.

In other words, works as expected, it’s either not a regression or an outright improvement (I hadn’t realised I needed to discard namespaces when switching, so haven’t tested that without this patch; before, it looked like things just didn’t work with intel).

The glxinfo in graphics-debug-tools-bboozzoo is still not finding the intel driver, though, which might be a bug in it (or in what we’re doing), but it’s still not a regression.

lucyllewy · March 23, 2018, 4:43pm

What are you stating isn’t a regression? You tested on an unaffected system because the problem occurs on 18.04. Therefore your statement that “it works fine” is incorrect with regards to the bug being discussed here.

Syntist · March 23, 2018, 4:43pm

Hmm, i guess there is no Vulkan Support here…
Arch Linux

➜  ~ graphics-debug-tools-bboozzoo.vulkaninfo
===========
VULKAN INFO
===========

Vulkan API Version: 1.0.61

Cannot create Vulkan instance.
/build/vulkan-YO5iw0/vulkan-1.0.61.1+dfsg1/demos/vulkaninfo.c:704: failed with VK_ERROR_INCOMPATIBLE_DRIVER

mcphail · March 23, 2018, 4:46pm

Doesn’t he mean the patch for 18.04 doesn’t cause a regression in 16.04?

lucyllewy · March 23, 2018, 4:48pm

Maybe. English is only my first language, so I’m likely to have misread :-p Words I do good readie writer

mborzecki · March 23, 2018, 5:16pm

I’m aware. FWIW, it’s the same on Arch too, but I haven’t tried to debug it further yet.