SIGSEGV / segmentation fault only on armhf version of wpe-webkit-mir-kiosk

In the latest version of my snap WPE WebKit Mir Kiosk (2.30.2), I’m facing a segfault at program startup during Wayland initialization. This is only on the armhf build, the amd64 build from the identical yaml works fine on a Core PC installation (gadget pc 18-2 r104, pc-kernel 4.15.0-122.124 r625). So I guess it’s either a subtle problem during compilation or something else isolated to that target architecture.

I pushed that faulty build (rev 51) to the edge channel so that others may reproduce the error, but please note that it will segfault right at service startup. If you have an amd64 Core device with mir-kiosk running, rev 50 on amd64/edge should work fine.

I’m neither super-comfortable with C/C++ nor its debugging, so I hope that someone here can point me in the right direction to get this running.

Build

  1. Built with snapcraft 4.3 from this snapcraft.yaml natively on a Raspberry Pi 4 w/ 4GB RAM + 4GB swap, vanilla core18 gadget image, inside an LXD container according to @ogra’s tutorial.
  2. As you can see in the snapcraft.yaml, I use gcc8/g++8 as bionic’s default gcc7 threw a compiler error; WPE maintainers advised to use gcc8.

Debugging results so far

environment

Test installation on several Raspberry Pi’s, one model 4 and one 3B. Both running core18 images with mir-kiosk 2.1.0-snap103 (latest stable).

Debugging flags: G_MESSAGES_DEBUG=all LIBGL_DEBUG=verbose WAYLAND_DEBUG=1

user@pi:~$ snap version
snap    2.47.1
snapd   2.47.1
series  16
kernel  5.3.0-1036-raspi2

service log

2020-11-02T16:30:50Z systemd[1]: Started Service for snap application wpe-webkit-mir-kiosk.browser.
2020-11-02T16:30:52Z -[10502]: platform_setup: Platform name: fdo
2020-11-02T16:30:52Z -[10502]: platform_setup: Platform plugin: libcogplatform-fdo.so
2020-11-02T16:30:52Z -[10502]: Initializing Wayland...
2020-11-02T16:30:52Z wpe-webkit-mir-kiosk.browser[10323]: [1158415.802]  -> wl_display@1.get_registry(new id wl_registry@2)
2020-11-02T16:30:52Z wpe-webkit-mir-kiosk.browser[10323]: [1158415.974]  -> wl_display@1.sync(new id wl_callback@3)
2020-11-02T16:30:52Z wpe-webkit-mir-kiosk.browser[10323]: [1158416.262] wl_display@1.delete_id(3)
2020-11-02T16:30:52Z wpe-webkit-mir-kiosk.browser[10323]: /snap/wpe-webkit-mir-kiosk/51/bin/launch-wpe: line 29: 10502 Segmentation fault      "$SNAP"/usr/bin/cog -P fdo --bg-color=black --enable-mediasource=1 --webprocess-failure=restart --enable-write-console-messages-to-stdout="$error_to_console" "$url"
2020-11-02T16:30:52Z systemd[1]: snap.wpe-webkit-mir-kiosk.browser.service: Main process exited, code=exited, status=139/n/a
2020-11-02T16:30:52Z systemd[1]: snap.wpe-webkit-mir-kiosk.browser.service: Failed with result 'exit-code'.
2020-11-02T16:30:52Z systemd[1]: snap.wpe-webkit-mir-kiosk.browser.service: Service hold-off time over, scheduling restart.
2020-11-02T16:30:52Z systemd[1]: snap.wpe-webkit-mir-kiosk.browser.service: Scheduled restart job, restart counter is at 15.
2020-11-02T16:30:52Z systemd[1]: Stopped Service for snap application wpe-webkit-mir-kiosk.browser.

snappy-debug

sudo journalctl --output=short --follow --all | sudo snappy-debug brings up nothing while running sudo snap run wpe-webkit-mir-kiosk.browser in a second terminal. Also, I guess confinement issues would appear in the amd64 version as well.

strace

sudo snap run --strace wpe-webkit-mir-kiosk.browser (starting from the message “Initializing Wayland” which indicates things are working until here, full strace here since it’s > 17MB)

expand strace
(cog:4119): Cog-FDO-DEBUG: 10:06:58.080: Initializing Wayland...
[pid  4119] write(1, "(cog:4119): Cog-FDO-\33[1;32mDEBUG"..., 85) = 85
[pid  4119] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 11
[pid  4119] connect(11, {sa_family=AF_UNIX, sun_path="/run/user/0/snap.wpe-webkit-mir-kiosk/wayland-0"}, 50) = 0
[pid  4119] write(2, "[1295381.194]  -> wl_display@1.g"..., 44[1295381.194]  -> wl_display@1.get_registry() = 44
[pid  4119] write(2, "new id wl_registry@", 19new id wl_registry@) = 19
[pid  4119] write(2, "2", 12)            = 1
[pid  4119] write(2, ")\n", 2)
)          = 2
[pid  4119] write(2, "[1295382.985]  -> wl_display@1.s"..., 36[1295382.985]  -> wl_display@1.sync() = 36
[pid  4119] write(2, "new id wl_callback@", 19new id wl_callback@) = 19
[pid  4119] write(2, "3", 13)            = 1
[pid  4119] write(2, ")\n", 2)
)          = 2
[pid  4119] sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1\0\0\0\1\0\f\0\2\0\0\0\1\0\0\0\0\0\f\0\3\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 24
[pid  4119] poll([{fd=11, events=POLLIN}], 1, -1) = 1 ([{fd=11, revents=POLLIN}])
[pid  4119] recvmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\34\0\1\0\0\0\7\0\0\0wl_drm\0\0\2\0\0\0\2\0\0\0"..., iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 400
[pid  4119] write(2, "[1295385.844] wl_display@1.delet"..., 37[1295385.844] wl_display@1.delete_id() = 37
[pid  4119] write(2, "3", 13)            = 1
[pid  4119] write(2, ")\n", 2)
)          = 2
[pid  4119] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x5} ---
[pid  4130] <... poll resumed> <unfinished ...>) = ?
[pid  4129] <... futex resumed>)        = ?
[pid  4128] <... poll resumed> <unfinished ...>) = ?
[pid  4127] <... poll resumed> <unfinished ...>) = ?
[pid  4126] <... poll resumed> <unfinished ...>) = ?
[pid  4125] <... futex resumed>)        = ?
[pid  4124] <... futex resumed>)        = ?
[pid  4129] +++ killed by SIGSEGV +++
[pid  4130] +++ killed by SIGSEGV +++
[pid  4128] +++ killed by SIGSEGV +++
[pid  4127] +++ killed by SIGSEGV +++
[pid  4126] +++ killed by SIGSEGV +++
[pid  4125] +++ killed by SIGSEGV +++
[pid  4124] +++ killed by SIGSEGV +++
[pid  4119] +++ killed by SIGSEGV +++
<... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0, NULL) = 4119
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0xb6e5e751}, {sa_handler=0x470705, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0xb6e5e751}, 8) = 0
openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C.UTF-8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C.utf8/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C/LC_MESSAGES/bash.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
fstat64(2, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/C/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, "/snap/wpe-webkit-mir-kiosk/51/bi"..., 250/snap/wpe-webkit-mir-kiosk/51/bin/launch-wpe: line 29:  4119 Segmentation fault      "$SNAP"/usr/bin/cog -P fdo --bg-color=black --enable-mediasource=1 --webprocess-failure=restart --enable-write-console-messages-to-stdout="$error_to_console" "$url"
) = 250
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=4119, si_uid=0, si_status=SIGSEGV, si_utime=10, si_stime=41} ---
wait4(-1, 0xbe9d2bfc, WNOHANG, NULL)    = -1 ECHILD (No child processes)
sigreturn({mask=[]})                    = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(139)                         = ?
+++ exited with 139 +++
error: exit status 139

Note that the “no such file or directory” errors for libc.mo at the end appear after the segfault, probably trying to localize the error message. Over in SIGSEGV (Address boundary error) - #5 by mmartinortiz , it was just a missing library, but I don’t see any library lookup errors in the strace right before the crash.

I also tried to probe it with gdb, but it’s a Release build, and the build with debug symbols is still running :upside_down_face: As this error only occurs on armhf, I used the --experimental-gdb-server variant described in the docs, with gdb-multiarch on an Ubuntu 20.04 amd64 machine. Works and connects, but without debug symbols it’s not useful.

Maybe @ogra, @alan_g or anyone else has a suggestion here? :unamused:

one thing that sticks out is that the snapcraft-preload loads lib32stdc++ for amd64 but for all other arches it loads libstdc++ …

@ogra How could I find out if that’s the culprit? I ran the snap service with strace on an amd64 installation, but that doesn’t contain lib32 – which is the grep-able difference I guessed from the package contents of lib32stdc++6 on amd64 and libstdc++6 on armhf :upside_down_face:

Two other differences I noticed in the build logs:

  1. During the build on armhf, gcc reports lots of cast-align warnings. That is not the case on amd64 builds (snapcraft remote-build log for amd64, don’t have one for armhf as I only have it locally). I know little enough about these topics to give it a solid “maaaaaaybe” … :sweat_smile:
  2. The amd64 build actually ignores the cmake parameter CMAKE_CXX_COMPILER=g++-8 (line 8089 in the remote-build log), whereas the armhf build doesn’t show this warning. Is it feasible to build a project with gcc8, but g++7? Distinguished readers might already have noticed that I only have a faint idea of the fundamentals here.

EDIT: I was looking at the wrong part – CXX_COMPILER is only ignored for the cog part, have to wait for the current armhf build to check if that’s also the case there. For the wpe-webkit part, gcc8/g++8 are used on both arches.

by simply adding a second “on armhf” to the thing that installs lib32stdc++ and doing a test build ?

Sorry, should’ve phrased that clearer: I don’t understand where lib32stdc++ comes from; it is neither listed as a direct build/stage package, nor is there any package on the amd64 LXD build container that depends on it (checked with apt-cache --installed rdepends lib32stdc++). Apart from that, lib32stdc++6 is only available on amd64/x390; other arches only have cross-compilation packages – or do I simply not understand this enough?

To be honest, I was confused by your mention of snapcraft-preload. I thought that’s the name for this helper by @sergiusens to work around hardcoded paths, but I dropped that for the 2.30.2 version since it’s no longer needed. Is there a snapcraft-internal thing by the same name? Sorry if it’s obvious, but I can’t make sense of this.

I’m all in for the “trial and error” approach, but I thought I’d ask in parallel while the builder is running >> 4 hours … (and aborted with EOF once again overnight).

Again, apologies if I’m asking obvious questions :upside_down_face:

uh, oh, ignore me i just noticed i looked at your old snapcraft.yaml, not at the most recent one … sorry for de-railing your focus with that…

Hey @tobias any news on this?

I was trying to build this using LXD container but faced some problems with setting up LXD

Hey @ogra , I’m trying to build wpe-webkit-mir-kiosk for raspberry pi 3. i’m getting the following error at

sudo lxd init –auto

Error: Failed to create network “lxdbr0” in project “default”: Failed to check dnsmasq version: Failed to run: dnsmasq –version: dnsmasq: relocation error: dnsmasq: symbol nettle_lookup_hash version NETTLE_6 not defined in file libnettle.so.6 with link time reference

https://github.com/lxc/lxd/issues/7207 suggests that this could be a kernel issue. I’m using Raspberry Pi 3 Model B Rev 1.2 with Ubuntu Core 18

lxd.check-kernel returns the following

/snap/lxd/18523/bin/lxc-checkconfig: 55: /snap/lxd/18523/bin/lxc-checkconfig: lxc-start: not found LXC version Kernel configuration not found at /proc/config.gz; searching… lxc-checkconfig: unable to retrieve kernel configuration

Try modprobe configs module, or Try recompiling with IKCONFIG_PROC, installing the kernel headers, or specifying the kernel configuration path with: CONFIG= lxc-checkconfig

to check on a Core system you need to use the CONFIG= variable like (as suggested in your output above):

CONFIG=/snap/pi-kernel/current/config-5.3.0-1036-raspi2 lxd.check-kernel

the default kernels for the pi UbuntuCore 18 images should actually have everythig enabled you need (it definitely works fine on all my UC18 installs over here using the official images and i think it is also part of the lxd snap release process to run tests on that environment)

issue was,

CONFIG_NF_NAT_IPV4: missing CONFIG_NF_NAT_IPV6: missing

a fresh core 18 install fixed the issue. Thanks!

1 Like

@kanishkasw Not yet, I had to shift priorities and didn’t have time to investigate further. What RPi are you using for the build? I build on a 4B with 4GB RAM, additional 8GB of swap on a 64GB USB3 stick and still sometimes get “unexpected EOF” or memory issues during a build. Next step is using a SSD.
Also, the build takes 6+ hours. I registered for Equinix Metal as they have bare metal ARM servers, but until they rolled out gen 3, their current “legacy” gen 2 is reserved for long-time customers.

Testing distcc as a way to speed up builds is on the todo list as well.