Repair capability (emergency fixes)

upcoming
mvo

#24

That’s a good point, but it also seems like a more general issue. Any clients which are not affected by the problem will be downloading binaries that they won’t be making use of. Perhaps we should look into a more general solution that can better filter repairs at query time?


#25

so far we don’t need much intelligence on the server (a hierarchy of dirs under apache would work for example), if we need real querying, we need to see how/when that can happen, it’s also more fragile from the experience withs snaps metadata.

an in between approach would be some sort of index file(s) on the server, we would need to think what’s the trust model for that


#26

We did discuss being able to query by arbitrary headers. Was that local only, or has that conversation ever reached the server?

I also can imagine something specific for this particular case. We might introduce a simple language that would allow, for example, defining that a given repair should only be used if some safe expression matches on the local system. But that sounds like something for later.


#27

first of all because these could be large (tens of MBs) there is some discussion whether they would be served through the general assertion service at all,

there is some querying functionality in the server but is not open in general, is used only internally between services, atm external clients can only get single assertions


#28

@niemeyer after further thinking I agree that we should remove arch from the primary key,

we really want gapless sequences of repairs coming

  1. from us for all devices
  2. optionally one for each brand for all their devices

the repair-id “brand_model-#” idea was to allow for a 3rd set of sequences targeting exactly specific device models of a brand.

Whether for each gapless sequence we want to do

GET repair-N
GET repair-N+1
GET repair-N+2

(getting though some not relevant ones)

or do queries? (but the current query capability we have even if opened doesn’t fit what we need here I think)

is a different matter;

notice that queries are tricky also because, we either get a stream with bodies in as well for many assertions, which is fragile to get and retry on,

or we could do a very simple query and get headers only first and then do GETs of the full assertions for the ones we really care about.


#29

given that now repairs can tell whether to retry them or not, the issue of having a repair-id such that older than it repairs are not relevant, don’t need to run is less pressing,

though we still don’t want images at first boot to download all repairs that ever existed (though this also shouldn’t be a problem for a while), so we will need a way to have conservative starting points for this that comes with images (through main snaps (core, gadget…), config or some seed data)


#30

btw, we had at some point the idea to have an expiry on repairs, for fixing stuff we agreed it doesn’t make sense, but for the debugging use case assertions it still might


#31

afternoon walk thought, if we

  1. go through (download+execute+decide what to download next) repairs in a sequence one at a time
  2. or have an immediate flag on repairs that means execute me now before going further downloading from the sequence

we can probably postpone this problem until we understand its nature more, by at least implementing

  • repair skip-to [–brand=BRAND] ID
  • repair rewind [–brand=BRAND] ID

(strawman syntax), which might make sense anyway, basically giving ourselves the tools to control how to navigate the sequences from the repairs themselves (if needed)


#32

@mvo @niemeyer I have updated the topic wiki to reflect I think the current thinking for the very first iteration and simplify things.


#33

this is the current thinking for the first implementation about:

How to retrieve each repair BRAND-ID/REPAIR-ID, HTTPS vs HTTP

  • Try to retrieve the headers only (as JSON) over HTTPS at:

    https://api.snapcraft.io/v2/repairs/BRAND-ID/REPAIR-ID

    filter whether it’s applicable or not (recording information and decision if not)

  • If applicable retrieve and verify the full repair (as application/x.ubuntu.assertion) also over HTTPS

When doing HTTPS use for verifying certificates a time given by the max(sys-time, time-lower-bound) (at least in case we got an error about time validity of the cert (not valid yet)).

Where time-lower-bound is obtained by considering the max of:

  • image creation time (timestamp of seed.yaml for example)
  • server reported time of previous successful HTTPS requests
  • timestamp of valid retrieved repairs
  • possibly time lower bound as kept by snapd itself

If HTTPS still fails (in case of TLS-related reasons) try again from scratch retrieving the full repair over HTTP.


#34

opened


#35

opened as well


#36

skeleton of where the actual running could happen:


#37

I have also have a in good shape WIP branch about the actual verification of the signature of repairs.


#38

created also a PR with the state initialisation logic as discussed at the sprint and with @mvo :


#39

also opened since:


#40

check list from London:

  • primary key is brand and number
  • scrips need to repair done|retry, default status is retry
  • even if assertion was run and marked for retry, look for new rev
  • when process starts, run local ones, then try fetching more #3935
  • “snap repairs” must show all revisions ever run
  • root trusted key inside tool, brand key comes with assertions #3616 #3930
  • only build snap-repair deb when really needed/wanted
  • copy snap-repair to local disk, replace only when necessary after it reports one successful run
    • replace only when digest changes
  • filter on series, arch, model #3787
  • make repair.json built out of seed/assertions w/ model #3571

#41

sample run (against staging):

http://pastebin.ubuntu.com/25585711/

using #3934 #3935


#42

Next phase work:

  • repairs from USB stick
  • early run (initramfs?)

#43

as discussed we should send out a test repair to some subset of devices (likely once other wip things in our work queues have cleared up a bit)