Firstboot seeding failure scenario + possible fixes, and boot process confusion/question

pstolowski · August 8, 2019, 3:12pm

Note: this is mostly interesting for snapd & ubuntu core devs.

I’ve been investigating the known cases where if first boot fails while seeding and the device is (manually) rebooted, it fails to boot and the device remains unbootable and unusable. This is easy to reproduce by creating an ubuntu core image with a broken gadget snap (e.g. having a malformed gadget.yaml), although it would probably happen on any other broken snap (I think) that is part of seeds. While there is not much to do if the seeds are broken, we should definately be resilient and not leave the device in unbootable state, so that the developer of the system can investigate the issue.

The problem is caused by “undo” handling code in snapd: when seeding change fails due to a broken snap, snapd starts undoing all the installs of seeded snaps. This includes kernel and core snaps, which are provided by seeds but also symlinked from ‘/var/lib/snapd/seed/snaps’ to ‘/var/lib/snapd/snaps’ on the (initial) image. The undo code removes .snap files from ‘/var/lib/snapd/snaps’, so it removes the symlink that is required for initrd to boot the system. I think the relevant code is in undoMountSnap handler - backend.UndoSetupSnap.

Snapd fix

I can think of two ways of fixing this in snapd at the moment, but neither of them is super elegant due to the fact that having kernel/core symlinks in snaps/ is not ideal in the first place[1], and there are possibly open questions; possible fixes are:

Special-case undo handlers for seeding, e.g. do not remove kernel/core .snap files (symlinks) if seeded flag is not set, keeping the system intact and bootable.
Install seeded snaps in separate lanes (like we do e.g. when installing multiple snaps at once with normal snap install ops) and as a consequence do not un-do those that were succesful (in particular leave kernel/core snaps intact if seeded successfuly) in case seeding as a whole fails.

The latter however raises a question of what is the state of the system in such case, it certainly is not “seeded” (and the ‘seeded’ flag should not be set), it is only partially seeded, so the system may appear usable, although it most probably is not (some seeded snaps not installed).
It’s worth noting that in both cases seeding will be retried every few minutes (although it’s bound to fail the same way, because if there is a broken snap, it remains broken).

Initrd fix

To mitigate the problem and make sure system is bootable we can also fix initramfs scripts (@ondra suggested we do it regardless of a potential snapd fix) and make it more robust by trying to mount core/kernel snaps from ‘/var/lib/snapd/snaps’, and fall back to ‘/var/lib/snapd/seed/snaps’. I proposed a fix here: https://github.com/snapcore/core-build/pull/51, however was unable to verify it, and this brings me to the last part.

"How do I test my patched initrd?" confusion

When I patch ubuntu-core-rootfs script of initrams, unsquash pc-kernel snap, re-pack its initrd, apply my fix, pack initrd back and put in back in kernel snap and squash it, and then create an image with:

ubuntu-image --image-size 3G -c beta -O my-image pc-amd64-16.model --snap pc-kernel_258.snap

I can see my updated initrd script is executed on first boot (I’ve added some debug output via /dev/kmsg to verify), however it only works because my initrd is included in kernel snap that’s symlinked to /var/lib/snapd/snaps’ (the happy scenario); if I wanted to test the failing scenario, the kernel snap symlink would need to be removed, so my fix is not effective; there must be some other place in the boot process that knows where to look for initrd that possibly needs patching too? Does /boot/initrd-core.img need patching maybe (if so - how?). Can anyone explain the boot process of ubuntu core?

Please share your thoughts re snapd & initrd fixes, and initrd testing,
Thanks

CC @pedronis and @mvo

[1] the symlinks exist to make initrd happy I’ve been told.

ondra · August 8, 2019, 5:29pm

@pstolowski thanks for great work! Just short comment about your concern about /boot/initrd-core.img This is only used during kernel snap build, only when building kernel snap from source. /boot/initrd-core.img is not used at runtime, so your test is totally valid and tests right thing.

But I just realised when running with grub bootloader, initrd.img is fetched from kernel snap. Which means, we have one more place to fix See grub bootloader config: https://github.com/snapcore/pc-amd64-gadget/blob/16/grub.cfg#L39

pstolowski · August 12, 2019, 10:51am

@ondra thanks for looking into this and for pointing out grub.cfg!

It seems to me though that this kind of fallback logic (path checks) is not possible in grub.cfg as loopback device is not mounted yet at this point, is it?

cratliff · August 12, 2019, 1:40pm

I think the retry of seeded snaps is a move in the right direction, but other than the kernel and gadget snaps I wonder why it’s necessary to treat seeded snaps differently than snaps installed otherwise.

A couple experiences I’ve had seeding snaps.

When I’ve used the network manager snap as a seeded snap it caused a bricked image because of the undo (I think it was an interface issue that caused the undo), but if I installed the same snap after the first boot the snap installed without issue. So in this case the retry might be very helpful.

After the network manager issue I saw seeding any snap other than absolutely necessary as a liability, but in the couple years I’ve been using snaps a couple snaps which I’ve used have been unlisted from the store and it’s been recommended not to use them. If I had made either of those seeded snaps I would be unable to remove them. To me this reinforced that I should not be seeding snaps unless it is required. But not treating seeded snaps differently would change this.

pedronis · August 16, 2019, 2:30pm

something like this encapsulated properly seems the correct thing to me. Although working around in initramfs is tempting, the point of undo is to put the system in the same state as it was before the change, in this case ATM we are failing at that.

pstolowski · August 19, 2019, 9:22am

something like this encapsulated properly seems the correct thing to me. Although working around in initramfs is tempting, the point of undo is to put the system in the same state as it was before the change, in this case ATM we are failing at that.

Thanks, I think that is a sensible suggestion, looking at it.

renat2017 · October 14, 2019, 9:44am

We’re experiencing a similar problem with new Core 18 enabled images.

Here is the information about the images:

Images are built from Ubuntu’s Pi3 official model assertions for Core 18 with an exception - we add our own screenly-pro snap to the image with required-snaps: ['screenly-client'] option.
When we build the image with Ubuntu’s Pi3 official model assertion - it works fine.

So it looks like something related or our snap is blocking the initial seeding. That means that I can’t get into the device because it doesn’t respond to the keyboard and not offering a login prompt, just a blinking cursor. I decided to power-cycle the device after 15 minutes of waiting, logged in to the device and got this from the logs:

Oct 14 09:03:57 localhost systemd[1]: Started Start the snapd services from the snapd snap.
Oct 14 09:03:57 localhost systemd[1]: Starting Wait until snapd is fully seeded (core18)...
Oct 14 09:03:57 localhost systemd[607]: snapd.seeded.service: Failed to execute command: No such file or directory
Oct 14 09:03:57 localhost systemd[607]: snapd.seeded.service: Failed at step EXEC spawning /usr/bin/snap: No such file or directory
Oct 14 09:03:57 localhost systemd[1]: snapd.seeded.service: Main process exited, code=exited, status=203/EXEC
Oct 14 09:03:57 localhost systemd[1]: snapd.seeded.service: Failed with result 'exit-code'.
Oct 14 09:03:57 localhost systemd[1]: Failed to start Wait until snapd is fully seeded (core18).

Guys, does anybody know how to make our Core18 images work?

Thanks!

ondra · October 14, 2019, 1:24pm

@renat2017
you need to get more logs from first boot, as this seems like system failed to seed. And we need start there. Any logs after like one you posted is just consequence, rather than cause.

some questions
is screenly-client using base core or core18? or not defined? If not defined, is your image also having core snap added? Otherwise seeding will fail with missing prerequisites.

To narrow it down, when you say images are build from official model assertion, with added required-snaps: section. How did you do that? Did you take pi3 core 18 model assertion and added there line
required-snaps: ['screenly-client']?
and then used that file to build image?

renat2017 · October 14, 2019, 1:48pm

@ondra, thanks for your suggestions. Some answers here:

you need to get more logs from first boot, as this seems like system failed to seed. And we need start there. Any logs after like one you posted is just consequence, rather than cause.

We will try to get the logs. The problem is that the system is unresponsive during the first boot and I couldn’t get into it to check the logs.

Is screenly-client using base core or core18? or not defined? If not defined, is your image also having core snap added? Otherwise seeding will fail with missing prerequisites.

Screenly-client’s base is not defined. Thanks for the suggestion, I will try to add a core snap to required-snaps and check what happens.

How did you do that?

Here is a part of a signed model assertion. We built a snap with it successfully.

type: model
authority-id: <screenly's authority id>
series: 16
brand-id: <screenly's brand id>
model: pi3-18
architecture: armhf
base: core18
gadget: pi=18-pi3
kernel: pi-kernel=18-pi3
required-snaps:
  - screenly-client
store: screenly
 ...

Thanks for help.

ondra · October 14, 2019, 2:07pm

@renat2017
lack of core would definitely cause problems when seeding at first boot

As for creating assertion:
to clarify.
did you simply edit existing assertion by adding new line
or
did you create new json file describing assertion, edit it and then sign it with your key to create new signed assertion?

To debug broken first boot, it’s a bit tricky, you need to use cloud init user data to create account at first boot, so you can log into the device

renat2017 · October 14, 2019, 2:23pm

As for creating assertion:

We wrote a new JSON and signed the assertion file.

To debug broken first boot, it’s a bit tricky, you need to use cloud init user data to create account at first boot, so you can log into the device

I will try shortly.

By the way, your suggestion about the core snap is working and not it shows a login prompt at least. We are going to try it with a cloud-init.

ondra · October 14, 2019, 2:46pm

yep, this was very likely caused by lack of dependencies(missing core snap) for the snap so problem solved

ogra · October 14, 2019, 3:01pm

on a sidenote, you should consider moving/porting your screenly-client snap (and other pre-seeded snaps, if there are any) to core18 eventually, else you waste disk-space for (3) additional core snap installs …

renat2017 · October 14, 2019, 3:12pm

Thanks guys. It worked.

Summary:

When migrating to core18, don’t forget to add core to model’s required-snaps

vpetersson · October 14, 2019, 5:54pm

Yeah we’ll definitely do that. We just don’t want to rock the boat too much.