Repair capability (emergency fixes)

niemeyer · May 19, 2017, 4:31pm

I’m not sure, to be honest. Seems hard to draw a line once we hand the feature off. It’s been described to me by people interested as something they’d be using for fixing systems up and debugging a certain class of problems. We probably shouldn’t assume much in terms of frequency. Worst case scenario is probably something like once a week, so ~50 a year.

niemeyer · May 19, 2017, 5:14pm

Some suggestions:

run-me-again → retry?
done-permanently → done?

Also, what’s the implied meaning of a plain exit? done, or retry? Perhaps retry, to be on the safe side?

Another detail: we should clearly specify what’s the protocol on that file descriptor, so that we may choose to send commands other than final statuses, such as logging. For that we’ll need a clear record delimiter.

Then, how about having a repair tool in the $PATH, so that one may run repair retry, etc, instead of fiddling with the file descriptor? We may use the symlink trick pointing it back to snap-repair itself so that we don’t have another moving part to worry about.

We should probably do that, but in addition to rather than instead of. In other words, running it every 4 hours via systemd is a good idea anyway, and it’d be good to have a run during boot which does not depend on systemd, so we can fix it if something more serious goes wrong.

This sounds more implicit and magic than the first proposal, which probably means more error prone as well.

Instead of having the architecture as part of the key, I suggest having it as a field that is a list, and doing the same with the model. If either is omitted it means all.

I don’t understand this part. It sounds like the repair-id should be [0-9]+ alone?

We need to think this through a bit more. A broken system that manages to update to a newer core won’t necessarily mean it’s not broken anymore. The idea of booting into a given core is also unclear when our repair system is actually not living inside snapd proper.

The underlying idea of having a way to define a good base is good, though. We just need to understand a bit better how to convey this information. Also, the core snap is not relevant for arbitrary brands. The system we end up with should allow the base repair to be provided for the model brand too.

Yes, a new revision seems saner. One thing we need to design for is this: we’re now saying repair assertions may run multiple times, and we’re also saying that they may be updated. That means a single assertion may end up running multiple times, within different revisions. We need to enable people to see the content and logs of each of every revision that was ever run, because we don’t want an ill-meaning entity to hide the fact an older revision of the assertion did something nasty and then got updated again, hiding the initial behavior.

Needs to address comment above.

Overall plan is going in a great direction, thanks!

pedronis · May 19, 2017, 5:54pm

the issue with that is if we have to emit the same repair (but with different binaries per arch), now they become successive repair-ids each one not run by the other architectures

it seems the typical cases are:

one repair is a shell script or something else that works for all architectures
a repair is architecture specific (this works well with both approaches)
we have a repair that is the same for all archs but we need a complicated binary so now we consume one repair id per arch, that’s doesn’t seem great to me

niemeyer · May 19, 2017, 5:59pm

That’s a good point, but it also seems like a more general issue. Any clients which are not affected by the problem will be downloading binaries that they won’t be making use of. Perhaps we should look into a more general solution that can better filter repairs at query time?

pedronis · May 19, 2017, 6:13pm

so far we don’t need much intelligence on the server (a hierarchy of dirs under apache would work for example), if we need real querying, we need to see how/when that can happen, it’s also more fragile from the experience withs snaps metadata.

an in between approach would be some sort of index file(s) on the server, we would need to think what’s the trust model for that

niemeyer · May 19, 2017, 6:27pm

We did discuss being able to query by arbitrary headers. Was that local only, or has that conversation ever reached the server?

I also can imagine something specific for this particular case. We might introduce a simple language that would allow, for example, defining that a given repair should only be used if some safe expression matches on the local system. But that sounds like something for later.

pedronis · May 19, 2017, 6:30pm

first of all because these could be large (tens of MBs) there is some discussion whether they would be served through the general assertion service at all,

there is some querying functionality in the server but is not open in general, is used only internally between services, atm external clients can only get single assertions

pedronis · May 19, 2017, 9:02pm

@niemeyer after further thinking I agree that we should remove arch from the primary key,

we really want gapless sequences of repairs coming

from us for all devices
optionally one for each brand for all their devices

the repair-id “brand_model-#” idea was to allow for a 3rd set of sequences targeting exactly specific device models of a brand.

Whether for each gapless sequence we want to do

GET repair-N
GET repair-N+1
GET repair-N+2

(getting though some not relevant ones)

or do queries? (but the current query capability we have even if opened doesn’t fit what we need here I think)

is a different matter;

notice that queries are tricky also because, we either get a stream with bodies in as well for many assertions, which is fragile to get and retry on,

or we could do a very simple query and get headers only first and then do GETs of the full assertions for the ones we really care about.

pedronis · May 20, 2017, 9:18am

given that now repairs can tell whether to retry them or not, the issue of having a repair-id such that older than it repairs are not relevant, don’t need to run is less pressing,

though we still don’t want images at first boot to download all repairs that ever existed (though this also shouldn’t be a problem for a while), so we will need a way to have conservative starting points for this that comes with images (through main snaps (core, gadget…), config or some seed data)

pedronis · May 20, 2017, 9:21am

btw, we had at some point the idea to have an expiry on repairs, for fixing stuff we agreed it doesn’t make sense, but for the debugging use case assertions it still might

pedronis · May 20, 2017, 2:35pm

afternoon walk thought, if we

go through (download+execute+decide what to download next) repairs in a sequence one at a time
or have an immediate flag on repairs that means execute me now before going further downloading from the sequence

we can probably postpone this problem until we understand its nature more, by at least implementing

repair skip-to [–brand=BRAND] ID
repair rewind [–brand=BRAND] ID

(strawman syntax), which might make sense anyway, basically giving ourselves the tools to control how to navigate the sequences from the repairs themselves (if needed)

pedronis · May 23, 2017, 4:47pm

@mvo @niemeyer I have updated the topic wiki to reflect I think the current thinking for the very first iteration and simplify things.

pedronis · June 15, 2017, 12:49pm

this is the current thinking for the first implementation about:

How to retrieve each repair BRAND-ID/REPAIR-ID, HTTPS vs HTTP

Try to retrieve the headers only (as JSON) over HTTPS at:

https://api.snapcraft.io/v2/repairs/BRAND-ID/REPAIR-ID

filter whether it’s applicable or not (recording information and decision if not)
If applicable retrieve and verify the full repair (as application/x.ubuntu.assertion) also over HTTPS

When doing HTTPS use for verifying certificates a time given by the max(sys-time, time-lower-bound) (at least in case we got an error about time validity of the cert (not valid yet)).

Where time-lower-bound is obtained by considering the max of:

image creation time (timestamp of seed.yaml for example)
server reported time of previous successful HTTPS requests
timestamp of valid retrieved repairs
possibly time lower bound as kept by snapd itself

If HTTPS still fails (in case of TLS-related reasons) try again from scratch retrieving the full repair over HTTP.

pedronis · June 20, 2017, 10:01am

opened

https://github.com/snapcore/snapd/pull/3496

pedronis · July 5, 2017, 3:38pm

opened as well

https://github.com/snapcore/snapd/pull/3560

pedronis · July 6, 2017, 8:43am

skeleton of where the actual running could happen:

https://github.com/snapcore/snapd/pull/3565

pedronis · July 6, 2017, 8:45am

I have also have a in good shape WIP branch about the actual verification of the signature of repairs.

pedronis · July 7, 2017, 12:12pm

created also a PR with the state initialisation logic as discussed at the sprint and with @mvo :

https://github.com/snapcore/snapd/pull/3571

pedronis · August 11, 2017, 2:56pm

also opened since:

https://github.com/snapcore/snapd/pull/3616

pedronis · August 21, 2017, 2:15pm

check list from London:

~~primary key is brand and number~~
~~scrips need to repair done|retry, default status is retry~~
~~even if assertion was run and marked for retry, look for new rev~~
~~when process starts, run local ones, then try fetching more #3935~~
~~“snap repairs” must show all revisions ever run~~
~~root trusted key inside tool, brand key comes with assertions #3616 #3930~~
only build snap-repair deb when really needed/wanted
copy snap-repair to local disk, replace only when necessary after it reports one successful run
- replace only when digest changes
~~filter on series, arch, model #3787~~
~~make repair.json built out of seed/assertions w/ model #3571~~