Repair capability (emergency fixes)

upcoming
mvo

#1

Rational

It is desired that we add an “Repair” mechanism for Ubuntu Core (all-snap) devices to deal with extraordinary situations that require an out-of-band emergency update when the regular snapd update mechanism is not working for whatever reason (like a previous bad update, an incompatible combination of software or something that we did not foresee now).

Because of the powerful nature of this feature we need a very clear design to ensure this mechanism is secure, resilient, effective and transparent.

Running repairs

What repairs to run

The snap-repair code will retrieve the repairs, distributed as assertions, to run in sequence, retrieving and executing one at a time.

When running snap-repair after booting for the first time, a device will start from the beginning of the sequence.
In the future we don’t want devices to download and run all repairs that ever existed, many of which would not be relevant anymore. We expect snap-repair to grow mechanisms to decide a starting-point in the sequence, combining information from the image and possibly querying a service. Given that repairs are assertions that can be revisioned and updated, we expect to be able to postpone detailing this starting-point mechanism by knowing that if needed we can use the first repair in the sequence that will be a NOP/testing repair to control this for images with the first iteration(s) of snap-repair.

When run the repair must echo one of the following states to $SNAP_REPAIR_STATUS_FD: (done, retry). The retry state is important, because the repair may act on a system that does not yet need a repair. E.g. core breaks in r100 but the assertion is already downloaded when core is at r99. we must ensure we re-run the script until r100 is reached.

The snap-repair run infrastructure will expose a repair helper (likely a symlink on PATH back to snap-repair) to help with those details:

  • repair done
  • repair retry
  • to control skipping over parts of the sequence: repair skip ID

If a repair script finishes without having emitted its state it will be assumed to be retry.

We also want a mechanism such that repairs can be presented to a device using a USB stick.

When to run repairs

We will run the repair fetcher/runner every 4h+random(4h) via a systemd timer unit. All new assertions or in retry state will be run and states updated.

We also ideally want to run repairs once per boot early (from initrd even if possible).

Assertion

We add a new assertion called “repair”. The primary key of the assertion is (brand-id, repair-id). The repair-id is initially defined as an increasing number starting from 1.

In order to fetch a repair assertion in the sequence, snap-repair will do a GET on an http repair url that takes the same form as the assertion endpoints.

The very first iteration of the mechanism will consider one sequence with brand-id canonical, useful to repair any Core device. It’s easy to extend this to have per brand sequences as well to consider, and later possibly model specific sequences (by extending the repair-id format and fetch and run logic).

summary is mandatory and should concisely document what the repair addresses.

A repair assertion contains no since/until header because we cannot trust the system clock. The timestamp header is just for reference about when the repair was created. The code that is run via the assertion should be as minimal as possible and just be enough to make the regular snapd update mechanism work again. It also needs to be idempotent and typically to check whether if the problem is not present it could instead occur later (e.g. broken update likely to come yet). It contains also optional lists of targeted series, architectures and models, where an omitted list means any. The run mechanism will use these lists to decide whether the repair should be run at all for the device.

There’s also an optional disabled boolean header used to mark fully retired or known-to-be-broken repairs.

Example:

type: repair
authority-id: acme
brand-id: acme
repair-id: 42
summary: this fixes everything
architectures: 
  - amd64
series:
  - 16
models:
  - acme/frobinator
  - acme/hal-10*
timestamp: 2017-06-19T09:13:05Z
body-length: 432
sign-key-sha3-384: Jv8_JiHiIzJVcO9M55pPdqSDWUvuhfDIBJUS-3VW7F_idjix7Ffn5qMxB21ZQuij

#!/bin/sh
set -e
echo "Unpack embedded binary data"
match=$(grep --text --line-number '^PAYLOAD:$' $0 | cut -d ':' -f 1)
payload_start=$((match + 1))
tail -n +$payload_start $0 | uudecode | tar -xzf -
# run embedded content
./fixup
exit 0
# payload generated with, may contain binary data
#   printf '#!/bin/sh\necho hello from the inside\n' > hello
#   chmod +x hello
#   tar czvf - hello | uuencode --base64 -
PAYLOAD:
begin-base64 644 -
H4sIAJl991gAA+3SSwrCMBSF4Yy7iisuoAkxyXp8RBOoDTR1/6Y6EQQdFRH+
b3IG9wzO4KY4DEWtSzfBuSVNcPo1n3ZOGdsqPjhrW89o570SvfKuh1ud95OI
ipcyfup9u/+p7aY/5LGvqYvHVCQt7yDnqVxlTlHyWPMpdr8eCQAAAAAAAAAA
AAAAAAB4cwdxEVGzACgAAA==

AXNpZw==
====

Straw-man for the implementation

There are some key properties we want ensure:

  • secure - we ensure the security of this feature by using assertions as the mechanism to implement them. The use of signatures ensure we have confidence to only allow legitimate repair assertions. Things to consider:

  • being able to revoke repair assertions via a revoke-repair assertion (alternatively we just publish a new revision of the existing assertion) to ensure that a repair assertion with bad code can not be used to attack.

  • Limit the authority who can issue repair assertions to Canonical initially (to ensure the system is not abused for things that are not the job of the repair assertion)

  • resilient - TBD

  • effective - we use the body to include a script that is run as the repair action. The content will be written to disk/tmpfs (in case the disk is full) and executed. This way we can ship easy shell (or perl/python) based fixes. But it also allows us to ship binaries by just embedding them into the script bia base64 encoding. An example will be included in the tests. We will also need to make sure that we handle big payloads, i.e. ensure that the assertion system can deal with multi-megabyte lines without choking. In addition we should send the output and error result of the script back to a repair-tracker (similar to our error tracker) to ensure that we can detect failing repair actions and act accordingly. In phase1 we might consider using the error tracker for this and only monitor failing actions.

  • transparent - when a repair runs we add information to syslog about it. In addition for each of the repairs we create a directory /var/lib/snapd/repair/run/{$BRAND_ID}/${REPAIR_ID}/ and put the following files in there:

    • r{assertion revision}.script: the repair scripts that were run
    • r{assertion revision}.done|retry]skip: the full output of the scripts run with the outcome status indicated by the file extension

    OTOH /var/lib/snapd/repair/assertions/{$BRAND_ID}/${REPAIR_ID}/r{assertion revision}.repair will contain the full
    repair assertion together with the auxiliary assertions as fetched in a stream.


Weeks 24-29 of 2017 in snapd
Week 21 of 2017 in snapd
#2

Thanks for getting this feature started, Michael.

Upon reading the details above, I got a feeling that “emergency” might not be the best term for the feature, because it implies a sense of urgency that is in fact not there. While we do intend for the fixes to be applied timely, this is not about a major flaw that is compromising the system and needs to keep people awake. It’s rather something important that needs to be applied on the first chance to put the system back in proper order. As a proposal, it sounds like the term repair would be more inline for the assertion and material around the feature. Besides the obvious meaning, it’s also nice because it contained the idea of “re-pairing”, which is the intention here: re-establish the proper flow of communication after something unexpected happened.

In terms of the ID, I suggest going for a more minimalist ID-based approach, and we can create a convention of calling those REPAIR-42, or similar. This also creates a more natural ordering, which may be useful depending on how we establish the succession of repairs should take place.

About that until field proposal, gut feeling is that it feels suspect. I can’t think of a reason to limit why a repair would have a defined time to run out. Either we want the fix to be made, or we don’t. If the idea is being able to disable it, perhaps a more explicit bool field (disabled?) would be better.

The four points you made are spot on, so here are some further ideas about each of them:

  • secure – This is an easy one for us to sort out at this stage because assertions already give us much of what we need.

  • resilient – This is a trickier one. The mechanism needs to be cooked in such a way that it will tend to work even when everything else is falling apart, and anticipating what will go wrong in the future without information from the past is non-trivial. A few initial ideas:

    • Tool that refreshes those assertions should be statically built and run out of band regularly via a timer
    • Same tool should also be run once during boot so it can repair systemd itself
    • Ideally the tool would also work through the automatic mechanism that we have in place for importing assertions from USB sticks, so that a problem that compromises the network can be recovered from.
    • Design for it should be minimalist, preferably with much simpler and to-the-point versions of the stock logic for assertions and whatever else is needed, so breakages in those sensible areas may be fixed
    • For those reasons, the tool needs to use its own minimalist storage space to control its own operation.
    • We need to have very good test coverage, to protect against problems in the compiler and building process in general, since this tool and underlying mechanism will be rarely used.
  • effective – On this front, we need to ensure that the mechanism enables us to perform the necessary fix. Some ideas around that:

    • We definitely need to be able to at least run a script from the assertion, but even that may turn out to be too restrictive if we consider problems such as broken network drivers. We may need to ship a whole new driver to restore the machine well being, and that obviously can’t be done via the network itself. So perhaps the body of the assertion should be a compressed tarball base64-encoded which has an entry command with a well known name (repair?).
    • We might support both: either a script starting with #! and an arbitrary interpreter, or a tarball with a repair command. This gives us better visibility when possible, which I expect to be the more common case (on a rarely seen event), and it still keeps our get-out-of-jail card for those even rarer circumstances.
  • transparent – This is an easy one to get right as well, I think. We definitely want to expose and keep track whenever the mechanism is used, so people know exactly which procedures have run on their devices, when, and what happened when that took place. That includes:

    • Proper logging into the usual places while the repair is taking place.
    • A command to list all repairs ever applied (or perhaps applied in the last year?), including the ability to obtain stdout/stderr generated while applying it. We need to watch out for concurrency issues here, since the tool will have its own storage space, and for obvious reasons we don’t want to lock the repair tool out of its own space due to a bug elsewhere.
    • We need to be careful with revisioning around those assertions. I think we want to allow updating the assertion to improve its logic while we’re addressing the problem, but at the same time we don’t want to have the new revision applied on devices that have already run an old revision, nor do we want to hide the actual revision applied since that contains the logic that has run on that particular device.

How does that sound?

I’ll keep updating those notes as I think of more details.


#3

I agree that since/until don’t make a lot of sense here, also because one issue might be that the clock of the device is totally off (something to think about in the verification bit as well).

What we need is probably a way in each snapd/core release to maybe declare which repairs are old, not relevant anymore.

Unless we can convince ourself that there will be at most one repair active? (the issue there is whether devices getting out of factories seriously broken and sitting on shelves for a long while is a real concern)

I fear we might get this one wrong if we don’t spend some time thinking exactly how they will be found/delivered.

I think we probably need a way to filter by arch also?


#4

Thanks @pedronis! I think you make a very important point about since/until, I removed it from the stawman now.

I also agree on that we need to think about how it is delivered/found. We could embed the build-date of the snap-repairer' into the binary and add asinceheader to therepair` assertions. Then we can compare the snap-repairer build date and the assertion date and if the assertion is older we ignore it (this way we avoid the dependency of the system clock). The assumption is that if snap-repair get updated to date X it means that the system was fine until at least date X.

Alternatively we could rely on the script itself to do defensive checks. I.e. if we need to replace a bad network driver, the script would have to contains a “md5sum” of the drive to see if it needs replacing. Defensive programming for these kinds of repair scripts is important anyway. But having an additional safety net to skip clearly unneeded repairs would be much better.


#5

because of all the various distributions I don’t think we can use build dates reliably. the naive thinking is that repair ids are progressive so we would just make sure to track the last previous one in the codebase. the hope is also still that this will be very few at least for a while.


#6

and yes I think the repair scripts need to be built defensively/ be idempotent either by pre-checks or how they are written, so running them again should never be a problem, otoh we don’t want all devices to download and run all the scripts all the time.


#7

@mvo was there discussion whether we will use this only on core? or also on classic?


#8

I think it makes a lot of sense to limit this to core only (at least for now). The classic systems all have another way (the native packaging system) to pull in updates.


#9

One remark about the clock being potentially completely off on a device… This will almost certainly result in HTTPS error imeddiately, shutting down any communication channel. We’ve seen this on the phone.


#10

Thanks for raising this @pstolowski - given that assertions contain the crypto we need we could simply use http for the network queries.


#11

Do we expect repair mechanism to be also able to patch state.json? IMHO that would be desirable in an event where a bug results in semi-corrupted state which affects future operations of snapd. If the answer to this is yes, then perhaps we should think about how we could apply such fixes to json data.


#12

Just a quick thought. We should add a mechanism to ensure that the repair doesn’t run too early. E.g. if we know that release 2.100 has a bug and we fixed it in 2.101 and issued a repair assertion and then someone gets the assertion on version 2.99 (it does nothing because the device is not broken) then for whatever reason updates to 2.100 the repair assertion should run then.


#13

the alternative might be to re-run things instead, basically when a new snapd gets installed it would re-run all repairs that it considers still relevant even if they were run previously,

if we have a mechanism to know wha is not relevant/too old, and they are idempotent that should work too,

a different related issue is that sometimes a repair need to be able to repair or prevent depending whether the breakage is immediate or is a some combination that happens only after a while some snapd/combination of things run


#14

In the PR for this the discussion about the primary key came up. To allow opening this up later we should add brand-id to the assertion and make the primary key (brand-id, repair-id).


#15

@mvo - would model be more appropriate than brand? i.e. you only want the repair to be applied to a matching model.


#16

@pstolowski The repair mechanism is open ended, precisely because we can’t anticipate what will break. That certainly includes daemon state.

@zyga @pedronis Good points. Perhaps we need to run every repair in order and then report whether it’s done or if wants to be run again later. A bit like our tasks today. We also need to drop the serialization requirement in that case, other than serialization of actual execution, as we may need to fix other things while we still have repairs active.

In that case we probably want repair-id to be just an integer, as each brand should be able to coordinate the repairs internally.

@noise Perhaps… it might make handling multiple models affected by the same problem a bit cumbersome, but let’s consider it furhter.


#17

One suggestion from @niemeyer was to make the script return a more meaningful status. Like (fixed-stuff, did-not-find-anything-to-fix, failed). This way we can just rerun things that are in “did-not-find-anything-to-fix” state.


#18

yes, though I expect some cases will have trouble to distinguish fixed vs non-found-anything-to-fix unless they look at versions of stuff which might be brittle


#19

We probably don’t need to worry too much about that upfront, since the scripts can do whatever they want and can also be updated with new revisions of the assertion.


#20

@mvo @niemeyer btw did we get a sense of how often partners would reach to use this (almost never like us?, couple of times a year per model?..) ? I imagine it should still be a repair mechanism for them too, not a generic remote execution thing, that’s something for a full management solution or something they could roll with their own snaps they ship with the device