In the last couple of days I’ve worked on Spread to fix some issues in our ever growing integration tests, both to reduce the chances of errors and to improve the test timings. We are just now going over the mark of 1600 jobs covering 7 different Linux distribution specifications, and we were frequently approaching 40 minutes to get over that.
There were three main improvements:
Until now all our test runs were sharing a common pool of 80 machines in Linode. In the busiest times of the day we could observe issues of contention with different runs waiting for machines to become available. Instead of continuing the trend of allocating more machines into the pool, Spread got support for dynamically allocating new systems on demand. This is slightly more expensive per run, but we’ll make up for it by not paying for the whole pool during quiet hours of the day and peaceful weekends.
Support for this is enabled via two new fields under the Linode backend:
These changes are already in the master branch, and our pre-allocated machine pool is gone altogether, so if you need to use Linode for something, please make sure to merge from master first.
Note that the implementation was made in such a way that if a run is killed for whatever reason, the machines will remain live, and the usual pool behavior kicks in. That is, the next run will reuse those machines after the halt timeout has passed, and will then remove the machine altogether at the end, as the former runner should have done.
More workers, larger
Given the above change, I’ve also pumped up the number of workers per node, and allocated workers that are 4 times as large as the ones we were using (2GB plan→8GB plan). I don’t have precise numbers yet, but I believe the savings will be enough for us to break even with the prior monthly cost of the system, or even improve on it a bit. I’ll keep an eye on this over the next weeks.
As I was investigating the timings of our test runs, I’ve realized that in some cases the longer tests were being started closer to the end of the run. The effect of this is that sharing across workers is hindered, because by then there’s nothing else for the other workers to be doing and they just terminate. If instead the larger tasks start earlier, the work stealing model of Spread means the other workers will get busy with the smaller tasks, and by the time the long tasks are finished their respective workers can also churn through the remaining few tasks, with all workers terminating closer to each other.
With that mind, support for a new priority field was added to Spread, and it may be specified at any level of the hierarchy (task, suite, system, backend). The larger the priority the earlier the task is scheduled. The default priority is zero, and negative priorities are supported.
This was then used in our offending tasks that take 2 minutes or more.
Travis is now taking around 23 minutes to run a full round of tests, and hasn’t failed on anything relative to machine allocation, except for initial quota issues that were sorted with Linode.
(side note: @cachio You can see in this build that the snap-service failed. It failed several times during my tests, and seems real. Can you please have a look at this when you have a moment?)
All the Spread changes are documented in the usual place:
Let me know!