Rational
It is desired that we add an “Repair” mechanism for Ubuntu Core devices to deal with extraordinary situations that require an out-of-band emergency update when the regular snapd update mechanism is not working for whatever reason (like a previous bad update, an incompatible combination of software or something that we did not foresee now).
Because of the powerful nature of this feature we created a very clear design to ensure this mechanism is secure, resilient, effective and transparent.
Running repairs
What repairs to run
The snap-repair code will retrieve the repairs, distributed as assertions, to run in sequence, retrieving and executing one at a time.
When running snap-repair after booting for the first time, a device will start from the beginning of the sequence.
In the future we don’t want devices to download and run all repairs that ever existed, many of which would not be relevant anymore. We expect snap-repair to grow mechanisms to decide a starting-point in the sequence, combining information from the image and possibly querying a service. Given that repairs are assertions that can be revisioned and updated, we expect to be able to postpone detailing this starting-point mechanism by knowing that if needed we can use the first repair in the sequence that will be a NOP/testing repair to control this for images with the first iteration(s) of snap-repair.
When run the repair must echo one of the following states to $SNAP_REPAIR_STATUS_FD: (done
, retry
). The retry
state is important, because the repair may act on a system that does not yet need a repair. E.g. core breaks in r100 but the assertion is already downloaded when core is at r99. we must ensure we re-run the script until r100 is reached.
The snap-repair run infrastructure exposes a repair
helper (likely a symlink on PATH back to snap-repair) to help with those details:
repair done
repair retry
- to control skipping over parts of the sequence:
repair skip ID
If a repair script finishes without having emitted its state it will be assumed to be retry
.
We also plan a mechanism such that repairs can be presented to a device using a USB stick.
When to run repairs
We will run the repair fetcher/runner every 4h+random(4h) via a systemd timer unit. All new assertions or in retry
state will be run and states updated.
We also plan to run repairs once per boot early (from initrd even if possible).
Assertion
We add a new assertion called “repair”. The primary key of the assertion is (brand-id, repair-id). The repair-id is initially defined as an increasing number starting from 1.
In order to fetch a repair assertion in the sequence, snap-repair will do a GET on an http
repair url that takes the same form as the assertion endpoints.
The current version of the mechanism will consider one sequence with brand-id canonical
, useful to repair any Core device. It’s easy to extend this to have per brand sequences as well to consider, and later possibly model specific sequences (by extending the repair-id format and fetch and run logic).
summary
is mandatory and should concisely document what the repair addresses.
A repair assertion contains no since/until header because we cannot trust the system clock. The timestamp
header is just for reference about when the repair was created. The code that is run via the assertion should be as minimal as possible and just be enough to make the regular snapd update mechanism work again. It also needs to be idempotent and typically to check whether if the problem is not present it could instead occur later (e.g. broken update likely to come yet). It contains also optional lists of targeted series
, architectures
and models
, where an omitted list means any. The run mechanism will use these lists to decide whether the repair should be run at all for the device.
There’s also an optional disabled
boolean header used to mark fully retired or known-to-be-broken repairs.
Example:
type: repair authority-id: acme brand-id: acme repair-id: 42 summary: this fixes everything architectures: - amd64 series: - 16 models: - acme/frobinator - acme/hal-10* timestamp: 2017-06-19T09:13:05Z body-length: 432 sign-key-sha3-384: Jv8_JiHiIzJVcO9M55pPdqSDWUvuhfDIBJUS-3VW7F_idjix7Ffn5qMxB21ZQuij #!/bin/sh set -e echo "Unpack embedded binary data" match=$(grep --text --line-number '^PAYLOAD:$' $0 | cut -d ':' -f 1) payload_start=$((match + 1)) tail -n +$payload_start $0 | uudecode | tar -xzf - # run embedded content ./fixup exit 0 # payload generated with, may contain binary data # printf '#!/bin/sh\necho hello from the inside\n' > hello # chmod +x hello # tar czvf - hello | uuencode --base64 - PAYLOAD: begin-base64 644 - H4sIAJl991gAA+3SSwrCMBSF4Yy7iisuoAkxyXp8RBOoDTR1/6Y6EQQdFRH+ b3IG9wzO4KY4DEWtSzfBuSVNcPo1n3ZOGdsqPjhrW89o570SvfKuh1ud95OI ipcyfup9u/+p7aY/5LGvqYvHVCQt7yDnqVxlTlHyWPMpdr8eCQAAAAAAAAAA AAAAAAB4cwdxEVGzACgAAA== AXNpZw== ====
Implementation notes
There are some key properties we ensured:
-
secure - we ensure the security of this feature by using assertions as the mechanism to implement them. The use of signatures ensure we have confidence to only allow legitimate repair assertions.
-
effective - we use the body to include a script that is run as the repair action. The content will be written to disk (or tmpfs in the future in case the disk is full) and executed. This way we can ship easy shell (or perl/python) based fixes. But it also allows us to ship binaries by just embedding them into the script bia base64 encoding.
-
transparent - when a repair runs we add information to syslog about it. In addition for each of the repairs we create a directory
/var/lib/snapd/repair/run/{$BRAND_ID}/${REPAIR_ID}/
and put the following files in there:-
r{assertion revision}.script
: the repair scripts that were run -
r{assertion revision}.done|retry]skip
: the full output of the scripts run with the outcome status indicated by the file extension
Addtionaly
/var/lib/snapd/repair/assertions/{$BRAND_ID}/${REPAIR_ID}/r{assertion revision}.repair
will contain the full repair assertion together with the auxiliary assertions as fetched in a stream. -