Enabling new snapd features on next boot

This is a technical post on a part of snapd implementation detail.

Problem statement

When designing some of the new cgroup based tracking features I realized this feature is hard to enable reliably. On a given machine that updates, either with the classic package or with the re-execution system, a new snapd with awareness of this feature and with the feature switched to enabled-when-unset will commence to use it immediately. So far that was mostly okay because features were not retroactively affecting existing behavior (e.g. parallel installs or layouts). With the new refresh app awareness feature this is not the case.

The feature only really works reliably if enabled before anything using said feature is used. Attempting to solve this problem would introduce large amount of complexity into the mix. Worse, that complexity would only be triggered in the edge case of a machine that has upgraded but has not rebooted yet.

I think I found an elegant and efficient way to solve this problem in general. Here’s how it would work:

/proc/sys/kernel/random/boot_id

This file contains an UUID of the current boot. Reading it we can reliably determine which boot we are on. We already use this file in the BoodID function.

Feature files store condition

Currently feature flags that are exported outside of snapd state are represented as empty files in the directory /var/lib/snapd/features. Old versions of snapd understand presence of a file as an indicator that a feature is enabled.

We can extend this idea to store some data in the file, that indicates more precise information when a feature is enabled. The following semantics is proposed:

  • When a feature file is absent the feature is disabled
  • When a feature file is present and empty the feature is enabled
  • When a feature file is present and non-empty the content has the scanf format
    %c %37s. The file encodes a single flag and a boot-id. Two flags are designed now ! and =. The ! flag is used when a feature is being enabled but must only become enabled on next boot. It can be understood as not this boot-id. The = flag is used when a feature flag is being disabled but most only become disabled on next boot. It can be understood as just this boot-id.

The advantage of this approach is that the files encode the correct behavior even if snapd doesn’t start and gets a chance to rewrite them.

snap debug features

To help in understanding of what is enabled and what isn’t I propose that a new command is added. It would print a table like this

$ snap debug features
Name                            Exported  State      Condition
hotplug                         no        enabled    -
layouts                         yes       enabled    -
refresh-app-awareness           yes       enabled    boot-id != 9c09e005-f08d-482d-a0a5-44cc3ac6bf2f
parallel-installs               yes       enabled    boot-id == 9c09e005-f08d-482d-a0a5-44cc3ac6bf2f

Changes to snapd state.json

Due to the way feature flags are used internally and to the way how they are exported to disk, we cannot rely on edge changes enabled > disabled or disabled > enabled. As such we must store the condition in the state.

I would like to propose that the state file store condition code as config.core.experimental.conditions.$name where $name is the name of the condition. I would propose that the exact same format used for external files be used internally. This approach is deemed safe against snapd rollback.

Changes to features API

The Go and C features API need the following changes.

  • Each flag needs to be either immediate or delayed. Delayed features participate in the condition evaluation. Immediate features work as features worked before. Note that this distinction is only relevant for Go side which handles changing the state.
  • The parallel-installs and refresh-app-awareness features should become delayed.
  • The C implementation needs consider the size of the file and evaluate the boot condition.

Changes to snap set system ...

The overlord will need a small change to store the condition for changes to features that have the delayed property. When enabling it will set the condition ! $boot_id while when disabling the condition becomes = $boot_id. Note that enabling a conditionally disabled feature or disabling a conditionally enabled feature can simply change the state immediately and can remove the condition.

For the “don’t enable it this boot” condition, why not have /run/snapd/<something> flag files?

edit: also, why are parallel installs delayed? I see no explanation for the shift; what am I missing?

We need to handle both enabling and disabling a feature on next boot. This has to be operational both inside and outside of snapd. While the fact that /run is reset on boot is useful, I think that having to synchronize two files rather than one adds extra complexity. Unless there’s a compelling simplification to be had this way I’d rather not use additional files.

As for parallel instances that’s a good question. It’s a bunch of extra state in my head but the short story is that we may need to change how /snap is mounted via a new systemd unit early on and that would really be only sensible to do on boot. Since I’m not focusing on the parallel instances question now I didn’t describe it in detail. I’d like to discuss with @mborzecki first.

this is in itself sounds backward incompatible, so far presence meant on

you’ll have to expand on the details of that. Is this about snapd? about snap-confine not having done tracking?

what happens with this state if I “snap revert” snapd or core? It’s unclear to me why tracking boot-ids would be enough

in general there is a lot of how and little why. If I revert snapd and use the wrong snap-confine for a bit and then snap revert forward again without reboots wouldn’t snap state not track this anyway?

Uh, this is a silly editing mistake. I did mean enabled here.

This is a great question. The primary use-case is reverting from version that wants to use a feature by default to one that doesn’t want to use it by default (primarily because it is incomplete).

The version that wants to use the feature by default, let’s call it B, must also implement the delayed semantics as described here. The version prior to that, let’s call it A, can also implement the delayed semantics but doesn’t have to.

When snapd starts up, there’s a bit of logic in the overlord that updates the exported features. At the same time we should store condition values in the state for the things that are not explicitly enabled in the state but are enabled when unset. This would also be reflected in the files for exported features. While version B is running snap debug feature would indicate that the new feature is enabled and that it has a condition ! $current_boot_id. We could also expand snap debug feature to have richer vocabulary and call this variant auto-enabled to reflect reality better.

From Go and C’s point of view the feature would be disabled because the condition was not met. None of the new logic guarded by that feature would trigger.

Now we revert. For the worst possible case, let’s revert to a version that doesn’t have any of the two. Snapd restarts and again, writes the feature files. Because the state indicates the feature is not enabled the feature files are removed. None of the C or Go code in version A will consider the feature to be enabled. So far so good.

What if the system was booted with version B and then reverted to version A?

This is more complex and it depends on how we design conditional features. Let’s cover the basics first and then the more problematic cases of system being already configured in a way that was driven by B.

At a basic level, the system booted with features conditionally enabled. The condition was met and various changes took place. Mid flight the operator issues a revert operation and snapd goes back from B to A. On startup version A will re-write the feature files, like indicated above. Again, they will simply disappear from disk. Some of the system has already acted on the new features: processes may inhabit specific cgroups. There may be a global (system wide) event handler installed. The /snap directory may be configured by a mount unit that changes event propagation.

None of the facilities of snapd in version A are effective. They operate as if nothing was changed. Additional, now partial, process tracking is unused. Additional mount units simply affect the system in a way that snapd version A does not measure. This does require some crafting of the features so that they don’t break past versions of snapd but this is a general property.

Version A will keep the, unused, part of the state where core.experimental.conditions tree is non-empty. If a future version B is used again then it will again observe the features going from unset-by-default to set-by-default and write down the condition strings again, taking care of any stale data.

There is one corner case that is worth mentioning. It doesn’t relate to the delayed feature proposal directly but to refresh app awareness. Let’s say version B installs a new named, controller-less cgroup with an release handler, so that snapd can reliably know when a cgroup is unused and can be disposed of. We must implement version B so that it doesn’t fail badly if, after installation of the handler, version B is removed from disk. In practice it seems like we must copy the event handler (which is a program touching a file) to /run/snapd/exec so that it can remain there for the duration of uptime. Note that version A will simply ignore the state files touched by the event handler. Note that even-more-future version C can replace the handler as well, if some more logic is necessary (though said logic should be behind a separate delayed feature flag).

The fact that this required such a long explanation (which I have only skimmed), shows there is likely not enough natural correlation between what is tracked and the prerequisites we want to ensure. I think we need to look at what the prerequisites are actually or could be for the various features and and devise mechanisms to check those explicitly on a family by family case.

We also need to consider that depending on the features the best UX might be different:

  • ask the user on an operation that it needs to reboot to use it
  • just proceed without the feature with warning or not

etc

I disagree with the conclusion but I think it will be easier to discuss the merits or faults of this proposal when you have more context. I am writing a description of what refresh-app-awareness and parallel-instances might require from the boot of the system to operate correctly. With that context we can revisit this thread.

yes, if reboots are relevant leveraging “/run” might be fruitful but anything that depends on snapd writing might imply the snap run chain/snap-confine waiting (which it does already sometimes). So in some cases direct checks of the prerequisite might be preferable.