Repair capability (emergency fixes)

I’m not sure, to be honest. Seems hard to draw a line once we hand the feature off. It’s been described to me by people interested as something they’d be using for fixing systems up and debugging a certain class of problems. We probably shouldn’t assume much in terms of frequency. Worst case scenario is probably something like once a week, so ~50 a year.

Some suggestions:

  • run-me-again → retry?
  • done-permanently → done?

Also, what’s the implied meaning of a plain exit? done, or retry? Perhaps retry, to be on the safe side?

Another detail: we should clearly specify what’s the protocol on that file descriptor, so that we may choose to send commands other than final statuses, such as logging. For that we’ll need a clear record delimiter.

Then, how about having a repair tool in the $PATH, so that one may run repair retry, etc, instead of fiddling with the file descriptor? We may use the symlink trick pointing it back to snap-repair itself so that we don’t have another moving part to worry about.

We should probably do that, but in addition to rather than instead of. In other words, running it every 4 hours via systemd is a good idea anyway, and it’d be good to have a run during boot which does not depend on systemd, so we can fix it if something more serious goes wrong.

This sounds more implicit and magic than the first proposal, which probably means more error prone as well.

Instead of having the architecture as part of the key, I suggest having it as a field that is a list, and doing the same with the model. If either is omitted it means all.

I don’t understand this part. It sounds like the repair-id should be [0-9]+ alone?

We need to think this through a bit more. A broken system that manages to update to a newer core won’t necessarily mean it’s not broken anymore. The idea of booting into a given core is also unclear when our repair system is actually not living inside snapd proper.

The underlying idea of having a way to define a good base is good, though. We just need to understand a bit better how to convey this information. Also, the core snap is not relevant for arbitrary brands. The system we end up with should allow the base repair to be provided for the model brand too.

Yes, a new revision seems saner. One thing we need to design for is this: we’re now saying repair assertions may run multiple times, and we’re also saying that they may be updated. That means a single assertion may end up running multiple times, within different revisions. We need to enable people to see the content and logs of each of every revision that was ever run, because we don’t want an ill-meaning entity to hide the fact an older revision of the assertion did something nasty and then got updated again, hiding the initial behavior.

Needs to address comment above.

Overall plan is going in a great direction, thanks!

the issue with that is if we have to emit the same repair (but with different binaries per arch), now they become successive repair-ids each one not run by the other architectures

it seems the typical cases are:

  • one repair is a shell script or something else that works for all architectures
  • a repair is architecture specific (this works well with both approaches)
  • we have a repair that is the same for all archs but we need a complicated binary so now we consume one repair id per arch, that’s doesn’t seem great to me

That’s a good point, but it also seems like a more general issue. Any clients which are not affected by the problem will be downloading binaries that they won’t be making use of. Perhaps we should look into a more general solution that can better filter repairs at query time?

so far we don’t need much intelligence on the server (a hierarchy of dirs under apache would work for example), if we need real querying, we need to see how/when that can happen, it’s also more fragile from the experience withs snaps metadata.

an in between approach would be some sort of index file(s) on the server, we would need to think what’s the trust model for that

We did discuss being able to query by arbitrary headers. Was that local only, or has that conversation ever reached the server?

I also can imagine something specific for this particular case. We might introduce a simple language that would allow, for example, defining that a given repair should only be used if some safe expression matches on the local system. But that sounds like something for later.

first of all because these could be large (tens of MBs) there is some discussion whether they would be served through the general assertion service at all,

there is some querying functionality in the server but is not open in general, is used only internally between services, atm external clients can only get single assertions

@niemeyer after further thinking I agree that we should remove arch from the primary key,

we really want gapless sequences of repairs coming

  1. from us for all devices
  2. optionally one for each brand for all their devices

the repair-id “brand_model-#” idea was to allow for a 3rd set of sequences targeting exactly specific device models of a brand.

Whether for each gapless sequence we want to do

GET repair-N
GET repair-N+1
GET repair-N+2

(getting though some not relevant ones)

or do queries? (but the current query capability we have even if opened doesn’t fit what we need here I think)

is a different matter;

notice that queries are tricky also because, we either get a stream with bodies in as well for many assertions, which is fragile to get and retry on,

or we could do a very simple query and get headers only first and then do GETs of the full assertions for the ones we really care about.

given that now repairs can tell whether to retry them or not, the issue of having a repair-id such that older than it repairs are not relevant, don’t need to run is less pressing,

though we still don’t want images at first boot to download all repairs that ever existed (though this also shouldn’t be a problem for a while), so we will need a way to have conservative starting points for this that comes with images (through main snaps (core, gadget…), config or some seed data)

btw, we had at some point the idea to have an expiry on repairs, for fixing stuff we agreed it doesn’t make sense, but for the debugging use case assertions it still might

afternoon walk thought, if we

  1. go through (download+execute+decide what to download next) repairs in a sequence one at a time
  2. or have an immediate flag on repairs that means execute me now before going further downloading from the sequence

we can probably postpone this problem until we understand its nature more, by at least implementing

  • repair skip-to [–brand=BRAND] ID
  • repair rewind [–brand=BRAND] ID

(strawman syntax), which might make sense anyway, basically giving ourselves the tools to control how to navigate the sequences from the repairs themselves (if needed)

@mvo @niemeyer I have updated the topic wiki to reflect I think the current thinking for the very first iteration and simplify things.

this is the current thinking for the first implementation about:

How to retrieve each repair BRAND-ID/REPAIR-ID, HTTPS vs HTTP

  • Try to retrieve the headers only (as JSON) over HTTPS at:

    https://api.snapcraft.io/v2/repairs/BRAND-ID/REPAIR-ID

    filter whether it’s applicable or not (recording information and decision if not)

  • If applicable retrieve and verify the full repair (as application/x.ubuntu.assertion) also over HTTPS

When doing HTTPS use for verifying certificates a time given by the max(sys-time, time-lower-bound) (at least in case we got an error about time validity of the cert (not valid yet)).

Where time-lower-bound is obtained by considering the max of:

  • image creation time (timestamp of seed.yaml for example)
  • server reported time of previous successful HTTPS requests
  • timestamp of valid retrieved repairs
  • possibly time lower bound as kept by snapd itself

If HTTPS still fails (in case of TLS-related reasons) try again from scratch retrieving the full repair over HTTP.

opened

https://github.com/snapcore/snapd/pull/3496

opened as well

https://github.com/snapcore/snapd/pull/3560

skeleton of where the actual running could happen:

https://github.com/snapcore/snapd/pull/3565

I have also have a in good shape WIP branch about the actual verification of the signature of repairs.

created also a PR with the state initialisation logic as discussed at the sprint and with @mvo :

https://github.com/snapcore/snapd/pull/3571

also opened since:

https://github.com/snapcore/snapd/pull/3616

check list from London:

  • primary key is brand and number
  • scrips need to repair done|retry, default status is retry
  • even if assertion was run and marked for retry, look for new rev
  • when process starts, run local ones, then try fetching more #3935
  • “snap repairs” must show all revisions ever run
  • root trusted key inside tool, brand key comes with assertions #3616 #3930
  • only build snap-repair deb when really needed/wanted
  • copy snap-repair to local disk, replace only when necessary after it reports one successful run
    • replace only when digest changes
  • filter on series, arch, model #3787
  • make repair.json built out of seed/assertions w/ model #3571
1 Like