Firstboot seeding failure scenario + possible fixes, and boot process confusion/question

pstolowski

#1

Note: this is mostly interesting for snapd & ubuntu core devs.

I’ve been investigating the known cases where if first boot fails while seeding and the device is (manually) rebooted, it fails to boot and the device remains unbootable and unusable. This is easy to reproduce by creating an ubuntu core image with a broken gadget snap (e.g. having a malformed gadget.yaml), although it would probably happen on any other broken snap (I think) that is part of seeds. While there is not much to do if the seeds are broken, we should definately be resilient and not leave the device in unbootable state, so that the developer of the system can investigate the issue.

The problem is caused by “undo” handling code in snapd: when seeding change fails due to a broken snap, snapd starts undoing all the installs of seeded snaps. This includes kernel and core snaps, which are provided by seeds but also symlinked from ‘/var/lib/snapd/seed/snaps’ to ‘/var/lib/snapd/snaps’ on the (initial) image. The undo code removes .snap files from ‘/var/lib/snapd/snaps’, so it removes the symlink that is required for initrd to boot the system. I think the relevant code is in undoMountSnap handler - backend.UndoSetupSnap.

Snapd fix

I can think of two ways of fixing this in snapd at the moment, but neither of them is super elegant due to the fact that having kernel/core symlinks in snaps/ is not ideal in the first place[1], and there are possibly open questions; possible fixes are:

  1. Special-case undo handlers for seeding, e.g. do not remove kernel/core .snap files (symlinks) if seeded flag is not set, keeping the system intact and bootable.
  2. Install seeded snaps in separate lanes (like we do e.g. when installing multiple snaps at once with normal snap install ops) and as a consequence do not un-do those that were succesful (in particular leave kernel/core snaps intact if seeded successfuly) in case seeding as a whole fails.

The latter however raises a question of what is the state of the system in such case, it certainly is not “seeded” (and the ‘seeded’ flag should not be set), it is only partially seeded, so the system may appear usable, although it most probably is not (some seeded snaps not installed).
It’s worth noting that in both cases seeding will be retried every few minutes (although it’s bound to fail the same way, because if there is a broken snap, it remains broken).

Initrd fix

To mitigate the problem and make sure system is bootable we can also fix initramfs scripts (@ondra suggested we do it regardless of a potential snapd fix) and make it more robust by trying to mount core/kernel snaps from ‘/var/lib/snapd/snaps’, and fall back to ‘/var/lib/snapd/seed/snaps’. I proposed a fix here: https://github.com/snapcore/core-build/pull/51, however was unable to verify it, and this brings me to the last part.

"How do I test my patched initrd?" confusion

When I patch ubuntu-core-rootfs script of initrams, unsquash pc-kernel snap, re-pack its initrd, apply my fix, pack initrd back and put in back in kernel snap and squash it, and then create an image with:

ubuntu-image --image-size 3G -c beta -O my-image pc-amd64-16.model --snap pc-kernel_258.snap

I can see my updated initrd script is executed on first boot (I’ve added some debug output via /dev/kmsg to verify), however it only works because my initrd is included in kernel snap that’s symlinked to /var/lib/snapd/snaps’ (the happy scenario); if I wanted to test the failing scenario, the kernel snap symlink would need to be removed, so my fix is not effective; there must be some other place in the boot process that knows where to look for initrd that possibly needs patching too? Does /boot/initrd-core.img need patching maybe (if so - how?). Can anyone explain the boot process of ubuntu core?

Please share your thoughts re snapd & initrd fixes, and initrd testing,
Thanks

CC @pedronis and @mvo

[1] the symlinks exist to make initrd happy I’ve been told.


#2

@pstolowski thanks for great work!
Just short comment about your concern about /boot/initrd-core.img
This is only used during kernel snap build, only when building kernel snap from source.
/boot/initrd-core.img is not used at runtime, so your test is totally valid and tests right thing.

But I just realised when running with grub bootloader, initrd.img is fetched from kernel snap.
Which means, we have one more place to fix :slight_smile:
See grub bootloader config: https://github.com/snapcore/pc-amd64-gadget/blob/16/grub.cfg#L39


#3

@ondra thanks for looking into this and for pointing out grub.cfg!

It seems to me though that this kind of fallback logic (path checks) is not possible in grub.cfg as loopback device is not mounted yet at this point, is it?


#4

I think the retry of seeded snaps is a move in the right direction, but other than the kernel and gadget snaps I wonder why it’s necessary to treat seeded snaps differently than snaps installed otherwise.

A couple experiences I’ve had seeding snaps.

When I’ve used the network manager snap as a seeded snap it caused a bricked image because of the undo (I think it was an interface issue that caused the undo), but if I installed the same snap after the first boot the snap installed without issue. So in this case the retry might be very helpful.

After the network manager issue I saw seeding any snap other than absolutely necessary as a liability, but in the couple years I’ve been using snaps a couple snaps which I’ve used have been unlisted from the store and it’s been recommended not to use them. If I had made either of those seeded snaps I would be unable to remove them. To me this reinforced that I should not be seeding snaps unless it is required. But not treating seeded snaps differently would change this.


#5

something like this encapsulated properly seems the correct thing to me. Although working around in initramfs is tempting, the point of undo is to put the system in the same state as it was before the change, in this case ATM we are failing at that.


#6

something like this encapsulated properly seems the correct thing to me. Although working around in initramfs is tempting, the point of undo is to put the system in the same state as it was before the change, in this case ATM we are failing at that.

Thanks, I think that is a sensible suggestion, looking at it.