I was looking into the use of the watchdog device today on the Raspberrry Pi version. What I found is that the kernel module does appear to be loaded:
$ lsmod | grep wdt
bcm2835_wdt 16384 0
While I couldn’t find a daemon or configuration in systemd, I was able to get this
$ sudo wdctl
wdctl: write failed: Invalid argument
Device: /dev/watchdog
Identity: Broadcom BCM2835 Watchdog timer [version 0]
Timeout: 15 seconds
Timeleft: 14 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 0 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
I’m not sure how this is configured, and I’d like to make sure that it is properly configured such that if the system freezes, it will get rebooted properly.
systemd has a builtin watchdog that is sadly not configurable currently, to enable it we’d have to make /etc/systemd/system.conf writable and allow the setting of:
RuntimeWatchdogSec=
… which defines the seconds after which the system should be auto-rebooted when no ping from the hardware watchdog comes in …
ShutdownWatchdogSec=
… allows to manage forced system reset from hardware in case a reboot hangs.
defining these watchdog values should probably become a core/system config option, @mvo, @niemeyer … comments ?
i think the default should be the current default (off), else you will have the images explode on hardware with buggy watchdog implementation etc.
it should be easily configurable via gadget.yaml (which turning it into a core/system option allows) and the reference platforms should have it set in there by default.
I would love to add that the eventual watchdog support should be comprehensive enough to be enabled in the boot loader so that snapd’s automatic revert to a working kernel/core snap kicks in on a failed boot without user interaction (power cycle) of the board.
Note that this is not about failed boots but about i.e. single core CPUs with constantly saturated I/O, hardware issues from perioherials, hangs though additional software that misbehaves etc …
i wouldnt tie the hardware watchdog to forced rollbacks here … we do have bootloader watchdog support enabled by default in our reference images, we also have the software watchdog enabled for services in systemd AFAIK and we have the automated roll-back for failed system boots. the hardware watchdog is not necessarily tied to kernel or rootfs here and i wouldnt artificially want to add this tie (it could indeed be an optional feature, but if you have a misbehaving service snap, would you really wnt your board to constantly roll back kernels because I/O is locked down by heavy DB operations or some such ?)
AFAIK we don’t have a watchdog enabled in the boot loader on any platform, is this fixed now?
a boot loader that hangs loading a broken kernel will not recover without a watchdog or human interaction
Once Linux has booted and systemd manages the watchdog nothing should fail even under high IO (unless systemd swaps out and the board literally comes down to a halt, at which point a reboot is probably expected but would not cause a rollback anymore).
I didnt say fail … the hardware watchdog checks for respose not for failures, a single core CPU like you find them in many routers with low ram and no swap can easily go into stalled IO simply because you can not properly multithread if your CPU does not allow this … a HW watchdog will check if the system still responds and if not it will force-reboot, If you install a postgres DB backed snap service on a beaglebone and teh amount of data gets to big for the hardware to stay responsive, the watchdog will take care for you, this has nothing to do with kernel or rootfs, it is a limitation of the hardware you use and no roll-back will fix the unresponsiveness.
a remote system that you can not reach to manually reset it needs to be able to auto-reset itself, so that you can still reach it to debug, fix and analyze it over the network, even if there was no kernel oops, OOM or whatnot you want this system to not eternally stay unresponsive due to saturation.
There is no db in a bootloader and i was not referring to this at all …
Also, i think @ppisati has put some time into enabling watchdog features in u-boot (though i might admittedly mis-remember this) and as long as the config option is enabled in u-boot it will do exactly this … reboot the device after a kernel hang.
But in any case the original request we discuss here is for the HW watchdog (/dev/watchdog) and enabling it in userspace, which we do not support at all currently. I dont really get what kernel or core rollbacks would achieve here when you are after simply recovering a saturated system.
Yep, i’m primarily interested in the HW watchdog. Given that the Raspberry Pi is a reference platform, and this being a fairly core feature in any serious deployment, can we schedule the work to enable this?
@vpetersson: Note that as a quick workaround (until there is a way to make snapd handle it) you should be able to dump a config file into /etc/systemd/system.conf.d/ with the RuntimeWatchdogSec= option set like:
the directory is writable, but indeed, adding a file there is a hack, this was just in case it is urgent for you to get such a thing enabled while a proper implmentaton is on the way …