Changing sys kernel settings relating to crash reboots on UC20

jocado · September 28, 2020, 6:28pm

I’m interested in what’s possible regarding changes to boot behaviour under certain “crash” conditions.

I found in here that dropping in sysctl config files was not a good practice and not very easy on UC20, and access to some sys network settings are allowed via the network-control interface.

Is there a similar interface that can allow changes to anything under /proc/sys/kernel/ or /proc/sys/vm ?

Of particular interest are things like:

kernel.panic
kernel.panic_on_oops
kernel.softlockup_panic
vm.panic_on_oom

Ignoring that fact that if something is really badly broken it may end up in a reboot cycle, and visibility of such conditions - the aim is to end up with less manual intervention required for such issues.

Being able to set these in some combination could be very useful.

Thanks!

Cheers,
Just

ijohnson · September 28, 2020, 6:54pm

The system-trace interface allows for

  /sys/kernel/debug/tracing/** rw,

which is probably not what you are looking for, but is the closest I could find looking through the interfaces we currently have. Additionally, it’s unclear to me if allowing a snap to set these kinds of settings would be useful to you if your kernel is dying then probably a userspace snap will not get a chance to effect anything useful.

In fact this is probably one of the cases where snapd should export more system config that can be set using defaults from the gadget.yaml. We do have the system.kernel.printk.console-loglevel config item which may be helpful to you, and if not is at least very close to the kinds of settings you want to set since it is writing to sysctl, and so it would be easy to propose a PR which adds the items you want.

Another option of course, since you are debugging a kernel problem, is that you could try to build a debug kernel with the settings you are interested in setting enabled via the kernel config.

jocado · September 28, 2020, 7:07pm

Yes, this is what I’m thinking really. It’s stuff we want to configure for an image, but as we can’t drop in sysctl config, we need to configure it from somewhere. If it was supported system config it would be awesome.

For reference, this isn’t so much about debugging issues, but ensuring that if we do hit the odd kernel bug or hardware “blip” [ as sometimes happens more frequently in uncontrolled environments ] the device just defaults to rebooting itself and starting again cleanly.

@ijohnson “New customer question” By PR do you mean Product Request, or Pull Request, or something else ? If Product Request, how would we raise that ?

Cheers, Just

ijohnson · September 28, 2020, 7:27pm

Sorry I meant a pull request to github.com/snapcore/snapd but certainly feel free to also just file a bug at bugs.launchpad.net/snapd and we will triage it.

ogra · September 28, 2020, 9:56pm

panic is set on the kernel commandline by the gadget and must be set to -1 in production to make sure the kernel rollback mechanism works … during development you can indeed set it how you like … i’d guess the other two kernel.* options have cmdline equivalents too that you can set in your gadget …

jocado · September 29, 2020, 8:27am

Thanks @ogra

panic=-1 is what we would prefer in fact, but thanks for pointing out that it the rollback feature relies on it too

Good point on linux kernel params. What’s the best practice for setting those via gadget ?

We can just alter the cmdline= in grub.conf at least I guess.

For recovery, does it boot using grub-recovery.conf then reboot immediately afterwards using standard grub.conf , or will it remain running after booting the recovery mechanism ? Any good docs on the recovery mechanism for UC20 ?

Cheers, Just