Error creating key

cachio · June 10, 2017, 3:12am

I was researching an issue in the command snap create-key. The issue is that the command create-key gets stuck and the test shows a timeout. For example here this test failed: https://travis-ci.org/snapcore/snapd/builds/241031225

I could reproduce it sporadically and found on the logs that there is a segfault logged by the kernel. I am working on research more about why this segfault is happening.

Here I leave all the data I got from the machine.

The command to reproduce it: https://paste.ubuntu.com/24819863/ (I used just one worker)
syslog: https://paste.ubuntu.com/24820119/
dmesg: https://paste.ubuntu.com/24819869/
kernel.log: https://paste.ubuntu.com/24819881/
systemd status: https://paste.ubuntu.com/24819949/
test output: https://paste.ubuntu.com/24819856/
I also saved the snapd state: http://people.canonical.com/~sjcazzol/snappy/snapd-state-create-key.tar.gz

Any help is welcome.

Thanks

cachio · June 12, 2017, 4:13am

I manually tried to create a key and it gets stuck. This are the commands running in the machine.
root 18725 18554 0 03:48 pts/1 00:00:00 snap create-key testcachio
root 18736 18725 0 03:48 pts/1 00:00:00 /usr/bin/gpg --homedir /root/.snap/gnupg -q --no-auto-check-trustdb --batch --gen-key

Then, I setup the rngd command and called gpg and I could create a key. After that the snap create-key command started working again: https://paste.ubuntu.com/24838922/ . This seem to be a problem on how the random data generation is setup in the tests.

fgimenez · June 12, 2017, 6:57am

Hi Sergio, thanks for looking into this! There are some fix in place to mitigate the lack of entropy you mentioned, see Snap create-key timeouts, we also use pollinate https://github.com/dustinkirkland/pollinate to seed the generator. With this in place the frequency of the timeout is much lower, but we are still being hit by it.

We have also included some debug info in order to output the entropy available when the test fails, in the log you posted above https://travis-ci.org/snapcore/snapd/builds/241031225#L3384

kernel.random.entropy_avail = 32

which is too low, despite all our efforts. Another interesting data point we have observed is that, after pointing the generator to /dev/urandom and seeding it, the timeout seems to be only happening on ubuntu-16.04-32 (this is confirmed by your log, but maybe it could be a good thing to try to reproduce the problem on other systems to be extra sure), @pedronis suggested to exclude this system from testing until we understand the root cause of the problem.

ogra · June 12, 2017, 8:38am

do you have access to manipulate the kernel cmdline ? if so, putting “rng_core.default_quality=700” on there should help a lot (will force the in-kernel rng to properly push the entropy up)

cachio · June 12, 2017, 1:31pm

Hi, we already are doing this before run the tests:

apt-get install -y -q rng-tools
echo "HRNGDEVICE=/dev/urandom" > /etc/default/rng-tools
/etc/init.d/rng-tools restart

@ogra, do you know if that segfault that appears in the logs could be making that the entropy is not generated correctly anymore and because of that we are getting what @fgimenez pointed?

 kernel.random.entropy_avail = 32

i’ll see if I can add "rng_core.default_quality=700, not sure if it is possible.

This is weird because this problems appears just the 5% of the executions.

ogra · June 12, 2017, 1:35pm

yes, that would be a possible explanation (and also one of the reasons why we do not ship userspace rng tools in ubuntu core but force the above in kernel number generator to actually have a proper entropy instead)

fgimenez · June 12, 2017, 2:13pm

Even more taking into account that the segfault comes from rngd:

Jun 10 01:39:19 ubuntu kernel: [  911.509447] rngd[3001]: segfault at 805f000 ip 0804ac3e sp bfaaa4dc error 6 in rngd[8048000+5000]

cachio · June 13, 2017, 3:07am

I have added a PR to address this issue.
https://github.com/snapcore/snapd/pull/3473

niemeyer · June 13, 2017, 12:59pm

@cachio As mentioned there, I’m merging it as it’s a step forward, but can’t we do this by default in the project prepare for all cases? There’s no reason for us to want real entropy for anything generated in those tests, as no artifacts are used. We can only ever get blocked by the usual semantics.

cachio · June 14, 2017, 2:22am

This is a new and simple implementation which restarts the rng-tool in case of a crash.

https://github.com/snapcore/snapd/pull/3477