UC 20 custom images issues

sborovkov · January 5, 2021, 1:45pm

Hi.

I’ve built a custom image for uc 20. It had 2 extra snaps compared to the default one.

The issue we have is that now our snaps get installed on the second boot. That creates a very bad UX for us. Is that an expected behavior? Can we modify it?

I actually thought the image was bricked when no UI showed up on the second boot. After sshing in I saw that our snap was in process of installation. After another few minutes it showed up on the screen (And if I thought that it might be broken what would users think?)

And then in few minutes another snap got installed (network-manager). It brought down network.
Changed IP address of the device. So our snap that actually connects to the internet and shows network info also had to display completely different info about connection now.

Am I missing something/is this configurable/beta thing?

ijohnson · January 5, 2021, 2:42pm

Hi, sorry this is happening for you… I have a couple of questions/comments

How did you add the additional snaps? Did you use the generic model assertion and use --snap ... arguments to ubuntu-image or did you add your snaps to the model assertion?

This is expected. The very first boot of a new UC20 image is called “install mode” and it will actually install the system, not the applications, a few things that are done in install mode are:

partition the disk, as not all partitions are created in the image anymore and are created dynamically at runtime
setup full disk encryption if that is a thing
setup other system data for the next boot (which is called “run mode” and the first boot into run mode does seeding, which installs your applications)

No, install mode is an intrinsic part of UC20 currently.

It is expected that when network-manager is installed it takes over networking and so the previous DHCP lease from systemd-networkd would be released and a new one requested. You can prevent this behavior by supplying netplan configuration that specifies to only use network-manager, but this means that you won’t have any IP address available until network-manager is done being seeded/installed on the first boot of run mode.

Probably what you want to have happen is you want to set things up such that network-manager is seeded before your other snap displaying a GUI, such that nothing confusing is displayed to the user until after networking is fully setup by network-manager.

Hope this helps.

sborovkov · January 5, 2021, 4:57pm

I used snaps field in the model assertion to include extra snaps.

ijohnson · January 5, 2021, 5:18pm

What is the ordering of those snaps, did you try putting the network-manager snap before the other snap in the snaps header? Snaps are seeded in the order in which they appear in the model assertion (with the caveats that the gadget, base, snapd and kernel snaps are always seeded before all application snaps), so putting network-manager before the other one should ensure that by the time your other application runs network-manager is done setting up its networking stuff

sborovkov · January 5, 2021, 6:37pm

Got it, thanks. It was the last one.

aljungberg · January 6, 2021, 9:07am

Maybe I’m jumping into something I don’t fully understand, but is it possible to switch from “install mode” to “run mode” without a reboot? Reboots are expensive in terms of wall time.

Or can we do the “install the system” stage ahead of time, when we’re creating our master image that the customer dd’s out to the SD card, so the only thing left to do on first boot is to expand the FS to match the SD card size?

jamesh · January 6, 2021, 11:33am

The full disk encryption support pretty much requires that install mode be run on the target device rather than once to build an image pushed out to multiple devices.

Each device should have its own FDE key, and that key needs to be loaded into its TPM chip.

vpetersson · January 6, 2021, 2:28pm

@jamesh how would this in a real-life environment?

Imagine this is a packaged end-user IoT device. It not acceptable from an end-user perspective that it takes the device 5-10 minutes to become usable. This appears to be the case right now if the apps are not bundled and needs to be pulled down after the first boot. Having a snap that is 300MB isn’t very uncommon if it requires graphics and dependencies etc.

ijohnson · January 6, 2021, 2:38pm

Not currently no. As @jamesh pointed out, if you are using FDE, then it is not at all possible to do install mode ahead of time. If you are not concerned about FDE, then there may be things that are technically possible with enough coding time on the snapd side of things (and indeed you are not the first folks to have asked about this), but it’s not currently being actively worked on.

Speaking from my own experience setting up devices, while 5-10 minutes to wait for a device to become usable is a long time to wait, I have had to wait for such devices like this before, and I think this experience is much more acceptable if there is a progress bar or some other such indicator that the device is doing something and not just spinning it’s wheels never making any progress. I would say from a purely practical standpoint, building out the code and functionality to be able to display a progress bar / splash screen setup will be easier to work on than getting UC20 to not need any reboots.

ogra · January 6, 2021, 3:23pm

with UC20 the expand feature is completely gone in favour of the “install mode” which in fact freshly partitions all free space on the device instead of expanding an existing partition.

the apps are still bundled in the installer partition (called ubuntu-seed i think) and get copied in place during the install step, the process is pretty much identical to what the former UC releases had (consuming most time for seeding the snaps on first boot of the installed OS etc) except for the partitioninng and additional reboot … it should not take much longer apart from the added time the reboot itself takes

jamesh · January 7, 2021, 12:50am

Maybe it would help to have a mode where “install mode” ends with the system powering down rather than rebooting? There’s no user configuration that occurs during install mode, so this could allow the slow part of the deployment to be carried out ahead of time.

ijohnson · January 7, 2021, 4:18am

right that’s what I was referring to with “technically possible with enough coding time on the snapd side of things”

jamesh · January 7, 2021, 9:41am

I was kind of wondering whether you could even do this for the FDE case. While you could make the argument that it is better to generate the key once the owner first powers on the device, they still need to trust that whoever created the device hasn’t done something to capture and exfiltrate the key.

And if you need to trust the device manufacturer anyway, then it doesn’t really matter when the key is generated.

vpetersson · January 7, 2021, 9:51am

I the case of a TPM backed, this wouldn’t work as the key is generated by the TPM. That said, i would imagine that only a small portion of deployments for (for Pi at least) would have TPMs.

As much as I love that UC20 adds FDE support (kudos to @ondra et al on this) , I’d argue that it’s important to not sacrifice the end-user experience for non-TPM users that likely constitutes the majority of users.

jamesh · January 7, 2021, 10:04am

What I’m suggesting is something like this:

write UC20 image to device in factory.
boot device and let it run through “install mode”, and then power down.
deliver device to customer.

Now the customer has a device that will be in “run mode” on first boot. They’d have to trust that no one has captured the disk encryption key during that boot in the factory, but you’re also trusting the software in the case where the FDE key is generated after the customer takes possession of the device.

vpetersson · January 7, 2021, 10:13am

Gotcha. Yeah that’s an interesting idea. I’m not sure that is viable in most supply chains however, as this would probably add 10+ minutes to each build.

One thing that I’d like to float is the ability to make “assumptions” in the disk image generation. For instance, if you’re operating at scale, you know what the BOM will look like. Thus you can make assumptions such as:

These are the snaps that needs to be installed (already supported AFAICT)
The size of the storage is X GB
- This means that the disk image will fail on anything smaller than the specified amount, but it will allow for bypassing the expansion/partitioning entirely and only marginally add to the (compressed) disk image size as it’s just zeros
The devices do/do not have TPMs (relevant for FDE)

ijohnson · January 7, 2021, 3:30pm

Sorry I think I had misunderstood or confused this post with another one wanting to boot a literally different device or otherwise clone an image that was booted on one device to multiple devices, what @jamesh has specified here is exactly what we anticipate most customers will do, in that the install mode boot happens in the factory, and the run mode boot happens in the “field”. We will have hooks and other things for what we are calling “factory mode”, which will essentially be install mode, but slight customizations to the behavior there, like having the device just shut down and not reboot at the end of install mode for example, or do system level tests, etc. for QA before the device leaves the factory.

This is already mostly supported by the gadget.yaml, you can specify the exact size of the disk that you want the Ubuntu Core system to be installed on, and the exact size of the partitions. It already works today to pre-create those partitions on a disk image out-of-band from ubuntu-image, then flash the device and in this case snapd will still wipe the partitions and re-create them, since we need the partitions to be empty before we start installing in case the previous install was interrupted or something with a power outage. We may be able to add additional optimizations here which speed things up, but I’m not convinced that creating partitions or formatting them, etc. is where we spend the most amount of time in install mode. Before we work on optimizing how long it takes for install mode to finish, we need to analyze it more to know what is taking the longest.

This can already be specified in the model assertion with the storage-safety: prefer-unencrypted setting.