Thanks for getting this feature started, Michael.
Upon reading the details above, I got a feeling that "emergency" might not be the best term for the feature, because it implies a sense of urgency that is in fact not there. While we do intend for the fixes to be applied timely, this is not about a major flaw that is compromising the system and needs to keep people awake. It's rather something important that needs to be applied on the first chance to put the system back in proper order. As a proposal, it sounds like the term repair would be more inline for the assertion and material around the feature. Besides the obvious meaning, it's also nice because it contained the idea of "re-pairing", which is the intention here: re-establish the proper flow of communication after something unexpected happened.
In terms of the ID, I suggest going for a more minimalist ID-based approach, and we can create a convention of calling those REPAIR-42, or similar. This also creates a more natural ordering, which may be useful depending on how we establish the succession of repairs should take place.
About that until field proposal, gut feeling is that it feels suspect. I can't think of a reason to limit why a repair would have a defined time to run out. Either we want the fix to be made, or we don't. If the idea is being able to disable it, perhaps a more explicit bool field (disabled?) would be better.
The four points you made are spot on, so here are some further ideas about each of them:
secure – This is an easy one for us to sort out at this stage because assertions already give us much of what we need.
resilient – This is a trickier one. The mechanism needs to be cooked in such a way that it will tend to work even when everything else is falling apart, and anticipating what will go wrong in the future without information from the past is non-trivial. A few initial ideas:
- Tool that refreshes those assertions should be statically built and run out of band regularly via a timer
- Same tool should also be run once during boot so it can repair systemd itself
- Ideally the tool would also work through the automatic mechanism that we have in place for importing assertions from USB sticks, so that a problem that compromises the network can be recovered from.
- Design for it should be minimalist, preferably with much simpler and to-the-point versions of the stock logic for assertions and whatever else is needed, so breakages in those sensible areas may be fixed
- For those reasons, the tool needs to use its own minimalist storage space to control its own operation.
- We need to have very good test coverage, to protect against problems in the compiler and building process in general, since this tool and underlying mechanism will be rarely used.
effective – On this front, we need to ensure that the mechanism enables us to perform the necessary fix. Some ideas around that:
- We definitely need to be able to at least run a script from the assertion, but even that may turn out to be too restrictive if we consider problems such as broken network drivers. We may need to ship a whole new driver to restore the machine well being, and that obviously can't be done via the network itself. So perhaps the body of the assertion should be a compressed tarball base64-encoded which has an entry command with a well known name (repair?).
- We might support both: either a script starting with #! and an arbitrary interpreter, or a tarball with a repair command. This gives us better visibility when possible, which I expect to be the more common case (on a rarely seen event), and it still keeps our get-out-of-jail card for those even rarer circumstances.
transparent – This is an easy one to get right as well, I think. We definitely want to expose and keep track whenever the mechanism is used, so people know exactly which procedures have run on their devices, when, and what happened when that took place. That includes:
- Proper logging into the usual places while the repair is taking place.
- A command to list all repairs ever applied (or perhaps applied in the last year?), including the ability to obtain stdout/stderr generated while applying it. We need to watch out for concurrency issues here, since the tool will have its own storage space, and for obvious reasons we don't want to lock the repair tool out of its own space due to a bug elsewhere.
- We need to be careful with revisioning around those assertions. I think we want to allow updating the assertion to improve its logic while we're addressing the problem, but at the same time we don't want to have the new revision applied on devices that have already run an old revision, nor do we want to hide the actual revision applied since that contains the logic that has run on that particular device.
How does that sound?
I'll keep updating those notes as I think of more details.