Travis and spread are back up

niemeyer · September 16, 2017, 12:48am

Hello @snapd team,

I’ve just confirmed Travis is back up and running.

The issues we’ve observed today were, as often happens in this sort of case, a consequence of multiple problems aligned:

Some specific event that is yet unknown made a large number of systems stall for a little while early today. Alone, this wouldn’t be an issue as Spread is supposed to be self-healing.
All problems happened about the same time, and almost all of our machines went running but not being remotely manipulated about the same time with a delta of a few minutes. Their job queues reflected that.
Because of a poorly designed API in Linode, a poorly designed hack in Spread was using the delta between the job queue times to find out if it was time to halt a machine. With all machines stuck at almost exactly the same time, this logic failed to halt machines.
As a secondary problem, the logic in Spread that obtained the current time based on job queues was too slow for the number of machines we’re using now. This was not a fatal problem on itself, but made the problem harder to observe.

All of these issues were fixed. The Linode machines were not touched other than by Spread itself, which means it was able to recover from the problem without manual administrative actions, as it’s supposed to. I’ve just restarted the tests after uploading a new version of Spread.

Apologies for the trouble. Please let me know if you see further problems.