We have identified a problem which occurs with snapd 2.73 (and 2.74) on some Core and Hybrid systems with TPM-backed FDE enabled. When it occurs, the system can end up in an unrecoverable state. A fix has already been identified and merged into secboot, and we are planning to cut a 2.74.1 release to fix the issue.
Root Cause
In snapd 2.71, we switched to a new secboot API, which broke some systems which were using User mode instead of Deployed mode.
In snapd 2.73, we fixed that issue, and added a new check (see 2.73 release update):
On Ubuntu Core systems using UEFI ≥ 2.5, when Deployed Mode is disabled, both AuditMode and DeployedMode are now measured and must be permitted to ensure that autorepair continues to function correctly.
This complies with the TCG PC Client Platform Firmware Profile Specification section 3.3.4.8 PCR[7] – Secure Boot Policy Measurements:
- If the system supports UEFI 2.5 or later and DeployedMode is NOT enabled, the following additional variables MUST be measured into PCR[7]:
- The contents of the AuditMode variable
- The contents of the DeployedMode variable
The problem is that not all device firmwares actually implement this spec properly. In particular, one of the following may be the case due to a bug in the device firmware:
- UEFI version ≥ 2.5, and has DeployedMode available, disabled, but not measured
- UEFI version is not ≥ 2.5, but still defines the DeployedMode variable for some reason
The build of secboot which ships in snapd 2.73 assumes that device firmware complies with the spec. The problem is that if the firmware is not compliant, secboot throws an error during reseal (when measurements are checked), and depending on when that reseal occurs, the system can be left in an unrecoverable state.
Solution
The solution is to relax the checks within secboot to not assume that DeployedMode and AuditMode are measured when DeployedMode is present and disabled. Instead, measure the DeployedMode and AuditMode variables if they appear as disabled in the event log, regardless of their presence in the profile. This fix was implemented, and will ship with snapd 2.74.1.
When/why can the system become unrecoverable?
Generally, snaps support automatic rollback, so if any error occurs during the installation, refresh, or configuration of a snap, that operation is rolled back and the system returns to its prior “good” state. On Ubuntu Core and Hybrid systems, the kernel, boot, and base filesystem are themselves snaps as well, and these “essential” snaps, when refreshed together, are part of the same transaction, which we call “lane”, during the refresh.
However, snapd is not automatically included in the same “lane” as these essential snaps if they are refreshed together. And it is also often possible to refresh snapd or any of the “essential” snaps independently.
Refreshing (or reverting) the kernel always triggers a reseal, but refreshing/reverting snapd does not.
Thus, if snapd is not refreshed as part of the same refresh and “lane” as the kernel, then snapd 2.73 may be installed without error. Then, a later kernel refresh will cause a reseal, which will fail due to the secboot problem in 2.73, causing the kernel refresh to fail and revert to the previous version (but not revert snapd). Since changing the kernel always triggers a reseal, then a reseal will be attempted with the old kernel (and still snapd 2.73), which will also fail, leaving the system unbootable.
Mitigations
If you are not using TPM-backed FDE, you are not affected by this bug, and no action is necessary.
If you are using TPM-backed FDE and your system has not yet refreshed to snapd 2.73:
- The safest option is to prevent snapd from refreshing until the fixed 2.74.1 release is available. This can be done via
snap refresh --hold snapd, or via validation sets. - Once the snapd snap is rebuilt with the secboot fix, you can test it from the latest/edge channel.
If your system has already refreshed to snapd 2.73, then there are a few possibilities:
- If you have not yet refreshed the kernel since refreshing snapd, or you’re not certain, then it is safest to revert snapd to 2.72 and then put a hold on it until 2.74.1 is available. This can be done via snap revert snapd and
snap refresh --hold snapd. - If you have refreshed the kernel snap since refreshing snapd to 2.73 (or as part of the same “lane” of the same refresh), your system is likely unaffected, though it’s still safest to revert snapd to 2.72 and hold until 2.74.1 is released.
Testing the fix
Once the snapd snap is rebuilt with the secboot fix, you can install it from the latest/edge channel.
If you need to check whether your system is affected by the bug, you should be able to do so by triggering a refresh to snapd 2.73 where snapd and the kernel are known to be in the same “lane”. You must ensure that both snapd and the kernel are actually updated during the refresh, so make sure snapd is still 2.72 prior to attempting it. Again, it is safest to put a refresh hold on snapd and wait until 2.74.1 is released. However, if you must test and are willing to risk making the device unrecoverable, then:
- If you are using validation sets, then you can ensure that snapd and the kernel are in the same lane by enforcing a new validation set, such as by
snap refresh --validate --enforce- (Auto-)Refreshes of existing validation sets will not necessarily be in the same lane
- If you are not using validation sets, then you can refresh snapd and the kernel together via
snap refresh --transaction=all-snaps snapd pc-kernel
If this refresh (to snapd 2.73) is successful and the system reboots successfully, then your device is unaffected. If the system reboots, fails to unlock, and reverts snapd and the kernel to the previous revisions, then the device is likely affected, and you should put a refresh hold on snapd and wait for snapd 2.74.1 to be released.
(thanks to @valentind @oac and @gairepravesh who contributed to investigating and fixing the bug, and this explanation)