Note: this is mostly interesting for snapd & ubuntu core devs.
I’ve been investigating the known cases where if first boot fails while seeding and the device is (manually) rebooted, it fails to boot and the device remains unbootable and unusable. This is easy to reproduce by creating an ubuntu core image with a broken gadget snap (e.g. having a malformed gadget.yaml
), although it would probably happen on any other broken snap (I think) that is part of seeds. While there is not much to do if the seeds are broken, we should definately be resilient and not leave the device in unbootable state, so that the developer of the system can investigate the issue.
The problem is caused by “undo” handling code in snapd: when seeding change fails due to a broken snap, snapd starts undoing all the installs of seeded snaps. This includes kernel and core snaps, which are provided by seeds but also symlinked from ‘/var/lib/snapd/seed/snaps’ to ‘/var/lib/snapd/snaps’ on the (initial) image. The undo code removes .snap files from ‘/var/lib/snapd/snaps’, so it removes the symlink that is required for initrd to boot the system. I think the relevant code is in undoMountSnap
handler - backend.UndoSetupSnap
.
Snapd fix
I can think of two ways of fixing this in snapd at the moment, but neither of them is super elegant due to the fact that having kernel/core symlinks in snaps/
is not ideal in the first place[1], and there are possibly open questions; possible fixes are:
- Special-case undo handlers for seeding, e.g. do not remove kernel/core .snap files (symlinks) if
seeded
flag is not set, keeping the system intact and bootable. - Install seeded snaps in separate lanes (like we do e.g. when installing multiple snaps at once with normal snap install ops) and as a consequence do not un-do those that were succesful (in particular leave kernel/core snaps intact if seeded successfuly) in case seeding as a whole fails.
The latter however raises a question of what is the state of the system in such case, it certainly is not “seeded” (and the ‘seeded’ flag should not be set), it is only partially seeded, so the system may appear usable, although it most probably is not (some seeded snaps not installed).
It’s worth noting that in both cases seeding will be retried every few minutes (although it’s bound to fail the same way, because if there is a broken snap, it remains broken).
Initrd fix
To mitigate the problem and make sure system is bootable we can also fix initramfs scripts (@ondra suggested we do it regardless of a potential snapd fix) and make it more robust by trying to mount core/kernel snaps from ‘/var/lib/snapd/snaps’, and fall back to ‘/var/lib/snapd/seed/snaps’. I proposed a fix here: https://github.com/snapcore/core-build/pull/51, however was unable to verify it, and this brings me to the last part.
"How do I test my patched initrd?" confusion
When I patch ubuntu-core-rootfs
script of initrams, unsquash pc-kernel snap, re-pack its initrd, apply my fix, pack initrd back and put in back in kernel snap and squash it, and then create an image with:
ubuntu-image --image-size 3G -c beta -O my-image pc-amd64-16.model --snap pc-kernel_258.snap
I can see my updated initrd script is executed on first boot (I’ve added some debug output via /dev/kmsg
to verify), however it only works because my initrd is included in kernel snap that’s symlinked to /var/lib/snapd/snaps’ (the happy scenario); if I wanted to test the failing scenario, the kernel snap symlink would need to be removed, so my fix is not effective; there must be some other place in the boot process that knows where to look for initrd that possibly needs patching too? Does /boot/initrd-core.img
need patching maybe (if so - how?). Can anyone explain the boot process of ubuntu core?
Please share your thoughts re snapd & initrd fixes, and initrd testing,
Thanks
[1] the symlinks exist to make initrd happy I’ve been told.