Wayland-launch: Irregular "XDG_RUNTIME_DIR" error

I’m using the wayland-launch from @alan_g for my Qt kiosk daemon on Ubuntu Core. Most of the time, it works smoothly. But recently, I started ending up with an error and a black screen from time to time:

Jun 08 14:16:30 localhost my_snap.daemon[1711]: QStandardPaths: XDG_RUNTIME_DIR points to non-existing path '/run/user/0/snap.my_snap', please create it with 0700 permissions.
Jun 08 14:16:30 localhost my_snap.daemon[1711]: Failed to create display (No such file or directory)
Jun 08 14:16:30 localhost my_snap.daemon[1711]: [nav_gui-9] process has died [pid 12028, exit code 1, cmd /snap/my_snap/83/opt/ros/melodic/lib/nav_gui/nav_gui_node __name:=nav_gui __log:=/root/snap/my_snap/83/ros/log/868732be-a97f-11ea-bc37-24f5a2f237ee/nav_gui-9.log].
Jun 08 14:16:30 localhost my_snap.daemon[1711]: log file: /root/snap/my_snap/83/ros/log/868732be-a97f-11ea-bc37-24f5a2f237ee/nav_gui-9*.log

What confuses me is, that most of the time, the application works fine and this error just appears irregularly. It seems like the snap’s $XDG-RUNTIME_DIR is missing and not the real one.

I’ve found the posts about missing XDG_RUNTIME_DIR, but I think this is exactly what wayland-launch takes care of.

  daemon:
    command: run-daemon wayland-launch roslaunch nav_launch my_snap_gui.launch
    daemon: simple
    restart-condition: always
    refresh-mode: endure # don't restart daemon during a snap refresh

When do you see the error? When the daemon starts?

When the error occurs, does the directory /run/user/0/snap.my_snap exist? What are the permissions?

In case of the error, the folder /run/user/0/snap.my_snap does not exist.

Exactly. As soon as I log-in after device start and check journalctl, it is there. When I restarted the daemon, the folder was created and the GUI started.

I just checked it on another Ubuntu Core device running a different image (amd64 vs. NUC image) and after 3 restarts, I got in the error situation.

It sounds as though creating the dir fails for some reason.

When the error occurs, is there anything in dmesg relating to /run/user/0/...?

Indeed:

dmesg | grep "/run/user/0/"
[   19.415247] audit: type=1400 audit(1591630016.336:110): apparmor="DENIED" operation="symlink" profile="snap.navion.daemon" name="/run/user/0/snap.navion" pid=1806 comm="ln" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

That’s the link command failing, but was there an error from the create directory command?

Could you check the few lines above that?

I can’t see anything suspicious from my side. I’ll try to paste my output.

[log shared privately]

I don’t see anything there either. How odd.

All I can suggest is adding some debugging to the script you use to try and illuminate the failure mode. Checking whether /run/user/0 exists, & what its permissions are etc.

So just that I get it right: The real XDG_RUNTIME_DIR already exists, contains the wayland-0 file and is /run/user/0/? Or is it /run/user/1000/? The environment variable points to 1000, but only 0 contains the wayland-0 file and is the one being symlinked.

The snap’s $XDG_RUNTIME_DIR pointing to /run/user/0/snap.my_snap, which fails to create, is the one you’re creating in wayland-launch? By “debugging the script” you mean the wayland-launch script from Github?

What you’re proposing is to add some outputs to the wayland-launch script. I assume I should inspect the output with journalctl or write to a file?

A daemon runs as root, so the non-snap XDG_RUNTIME_DIR (/run/user/$(id -u)) is /run/user/0/.

If mir-kiosk (or some alternative compositor) is running, then /run/user/0/ should contain wayland-0.

Correct.

Just add something like find /run/user/0/ to the script. That will check our expectation of the content (and existence) of /run/user/0/.

You can view the output using from the script using:

snap logs <your snap>

This takes similar options to tail, so I often use incantations like:

snap logs -n 100 -f my_snap

HTH

I just realized that the mkdir actually throws an error in the log:

Jun 09 14:23:38 localhost my_snap.daemon[1393]: + mkdir -p /run/user/0/snap.my_snap -m 700
Jun 09 14:23:38 localhost my_snap.daemon[1393]: mkdir: cannot create directory ‘/run/user/0’: Permission denied

I tried what you proposed (checking before mkdir if folder exists and permissions). However, I never get the logs at the beginning. So I added a 4s sleep at the beginning of wayland-launch. Now I tried to power up the system for at least 20 times not having any issues with the XDG_RUNTIME_DIR. Before, it took 3 to 5 restarts. Maybe I’m just lucky now (or unlucky since I want to see the error again). But could it be that the whole issue is about timing? Like the script in the daemon is executed too early and the daemon’s runtime folder permission wasn’t granted yet? This could explain the variability of the outcome.

OK. So it’s not creating /run/user/0/snap.my_snap that fails, but instead, that /run/user/0 doesn’t exist. (And this snap doesn’t have the permissions to create it - whereas mir-kiosk does.)

So, it looks like we should more the mkdir until after the socket exists, let’s try something like:

#!/bin/sh

set -x

real_wayland=$(dirname "$XDG_RUNTIME_DIR")/${WAYLAND_DISPLAY:-wayland-0}
while [ ! -O "${real_wayland}" ]; do echo waiting for Wayland socket; sleep 4; done

mkdir -p "$XDG_RUNTIME_DIR" -m 700
ln -sf "${real_wayland}" "$XDG_RUNTIME_DIR"
unset DISPLAY

exec "$@"

This seems to work. Tried to power up the system 20 to 30 times without the error. Thank you!

Are you going to change this in the public repo?

If I understand correctly, mir-kiosk is responsible to create the /run/user/0 folder wayland is using and to put the wayland-0 file there. The problem here is that my daemon sometimes starts before the mir-kiosk is ready?

I agree, that explains what you’ve been seeing and proves we have a solution.

Thanks for reporting the problem and your persistence in helping find a solution. I’ll update the mir-kiosk-snap-launch repo.

Thank you for helping to fix this so quickly. The test users of our embedded system just started to complain about the black screen.