Hardware watchdog on Ubuntu Core

vpetersson · May 30, 2018, 6:01pm

I was looking into the use of the watchdog device today on the Raspberrry Pi version. What I found is that the kernel module does appear to be loaded:

$ lsmod | grep wdt
bcm2835_wdt            16384  0

While I couldn’t find a daemon or configuration in systemd, I was able to get this

$ sudo wdctl
wdctl: write failed: Invalid argument
Device:        /dev/watchdog
Identity:      Broadcom BCM2835 Watchdog timer [version 0]
Timeout:       15 seconds
Timeleft:      14 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          0           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0

I’m not sure how this is configured, and I’d like to make sure that it is properly configured such that if the system freezes, it will get rebooted properly.

ogra · May 31, 2018, 10:40am

systemd has a builtin watchdog that is sadly not configurable currently, to enable it we’d have to make /etc/systemd/system.conf writable and allow the setting of:

RuntimeWatchdogSec=

… which defines the seconds after which the system should be auto-rebooted when no ping from the hardware watchdog comes in …

ShutdownWatchdogSec=

… allows to manage forced system reset from hardware in case a reboot hangs.

defining these watchdog values should probably become a core/system config option, @mvo, @niemeyer … comments ?

vpetersson · May 31, 2018, 11:02am

Got it. My preference would be that it would be populated with sensible default values (but configurable) for each enabled platform.

ogra · May 31, 2018, 11:09am

i think the default should be the current default (off), else you will have the images explode on hardware with buggy watchdog implementation etc.

it should be easily configurable via gadget.yaml (which turning it into a core/system option allows) and the reference platforms should have it set in there by default.

vpetersson · May 31, 2018, 11:20am

Oh, I meant only for “enabled hardware” such as the Raspberry Pi, where you know the exact hardware components.

vpetersson · June 6, 2018, 7:41pm

@ogra will this make it onto the roadmap? If so, when?

zyga-snapd · June 6, 2018, 8:39pm

I would love to add that the eventual watchdog support should be comprehensive enough to be enabled in the boot loader so that snapd’s automatic revert to a working kernel/core snap kicks in on a failed boot without user interaction (power cycle) of the board.

ogra · June 7, 2018, 10:13am

Note that this is not about failed boots but about i.e. single core CPUs with constantly saturated I/O, hardware issues from perioherials, hangs though additional software that misbehaves etc …
i wouldnt tie the hardware watchdog to forced rollbacks here … we do have bootloader watchdog support enabled by default in our reference images, we also have the software watchdog enabled for services in systemd AFAIK and we have the automated roll-back for failed system boots. the hardware watchdog is not necessarily tied to kernel or rootfs here and i wouldnt artificially want to add this tie (it could indeed be an optional feature, but if you have a misbehaving service snap, would you really wnt your board to constantly roll back kernels because I/O is locked down by heavy DB operations or some such ?)

zyga-snapd · June 7, 2018, 10:23am

I don’t understand your statements ogra:

there are no databases in the bootloader
AFAIK we don’t have a watchdog enabled in the boot loader on any platform, is this fixed now?
a boot loader that hangs loading a broken kernel will not recover without a watchdog or human interaction

Once Linux has booted and systemd manages the watchdog nothing should fail even under high IO (unless systemd swaps out and the board literally comes down to a halt, at which point a reboot is probably expected but would not cause a rollback anymore).

ogra · June 7, 2018, 10:37am

I didnt say fail … the hardware watchdog checks for respose not for failures, a single core CPU like you find them in many routers with low ram and no swap can easily go into stalled IO simply because you can not properly multithread if your CPU does not allow this … a HW watchdog will check if the system still responds and if not it will force-reboot, If you install a postgres DB backed snap service on a beaglebone and teh amount of data gets to big for the hardware to stay responsive, the watchdog will take care for you, this has nothing to do with kernel or rootfs, it is a limitation of the hardware you use and no roll-back will fix the unresponsiveness.

a remote system that you can not reach to manually reset it needs to be able to auto-reset itself, so that you can still reach it to debug, fix and analyze it over the network, even if there was no kernel oops, OOM or whatnot you want this system to not eternally stay unresponsive due to saturation.

There is no db in a bootloader and i was not referring to this at all …

Also, i think @ppisati has put some time into enabling watchdog features in u-boot (though i might admittedly mis-remember this) and as long as the config option is enabled in u-boot it will do exactly this … reboot the device after a kernel hang.

But in any case the original request we discuss here is for the HW watchdog (/dev/watchdog) and enabling it in userspace, which we do not support at all currently. I dont really get what kernel or core rollbacks would achieve here when you are after simply recovering a saturated system.

ppisati · June 7, 2018, 11:22am

As far as watchdog support for the bcm2835 in uboot goes, i wrote it and
pushed upstream ~1year ago:

http://git.denx.de/?p=u-boot.git;a=commit;h=45a6d231b2f9b891a7df517fc40b8466e12f2b57

so if you want to play with it, it’s there.

vpetersson · June 11, 2018, 10:31am

Yep, i’m primarily interested in the HW watchdog. Given that the Raspberry Pi is a reference platform, and this being a fairly core feature in any serious deployment, can we schedule the work to enable this?

ogra · June 11, 2018, 12:25pm

Well, waiting for feedback from the core team … @niemeyer, @mvo ?

(also, more details are at http://0pointer.de/blog/projects/watchdog.html)

@vpetersson: Note that as a quick workaround (until there is a way to make snapd handle it) you should be able to dump a config file into /etc/systemd/system.conf.d/ with the RuntimeWatchdogSec= option set like:

[Manager]
RuntimeWatchdogSec=10

vpetersson · June 11, 2018, 2:47pm

@ogra - Isn’t that a read-only filesystem tho? If we modify it in the image creation process, we have no way to revert it.

ogra · June 11, 2018, 2:50pm

the directory is writable, but indeed, adding a file there is a hack, this was just in case it is urgent for you to get such a thing enabled while a proper implmentaton is on the way …

vpetersson · June 11, 2018, 3:02pm

No, it’s not that urgent. I rather wait for the proper implementation.

On that note, where is the timeline/roadmap set for Snapd? I noticed that it is not on Github.

ogra · June 11, 2018, 5:54pm

heh … and obviously there is even a related launchpad bug … filed in march 2017:

vpetersson · June 11, 2018, 6:16pm

Can we squeeze that into the next sprint/release then?

abeato · June 12, 2018, 2:55pm

Related PR: https://github.com/snapcore/snapd/pull/5309