Hello all,
For the record, from today on Spread is supposed to recover by itself even from the unfortunate space error that looks like this:
2017-08-24 06:56:54 Cannot allocate linode:ubuntu-14.04-64: cannot create Linode disk with ubuntu-14.04-64: you do not have enough unallocated storage to create this Disk (1452 requested, but only 0 available
Some quick background for future generations: Spread has no central manager service, which means it needs nobody baby sitting it and we can all have good holidays. For that to work we need a way to control allocation conflicts in backends that share resources such as our Linode one, which is what we use for Travis.
Unfortunately, it turns out that it’s surprisingly difficult to do that in Linode today, because the API works in terms of a queue of jobs, which means we don’t really have a good atomic control point to detect whether a given machine was allocated or not. The trick we use for that is to allocate storage and then see if somebody else allocated storage before we did, and cooperatively give up on the allocation if that’s the case, and look for another machine.
This works fine in most cases, and we were already cleaning up old storage left behind at the end of runs if they were allocated too far in the past (the halt-timeout). But it was still failing in the case where we have some problem elsewhere that causes allocated storage to be aggressively left behind, to a point where even that first allocation doesn’t work anymore.
This was fixed and now Spread will kill old storage that is beyond the halt-timeout if it is unable to perform that first allocation. Note that we’ll still observe the allocation error, but it should naturally clean after itself and the machine will become reusable again shortly after the error.