Dynamic allocations and priorities in Spread

niemeyer · January 24, 2018, 1:34am

Hello all,

In the last couple of days I’ve worked on Spread to fix some issues in our ever growing integration tests, both to reduce the chances of errors and to improve the test timings. We are just now going over the mark of 1600 jobs covering 7 different Linux distribution specifications, and we were frequently approaching 40 minutes to get over that.

There were three main improvements:

Dynamic allocations

Until now all our test runs were sharing a common pool of 80 machines in Linode. In the busiest times of the day we could observe issues of contention with different runs waiting for machines to become available. Instead of continuing the trend of allocating more machines into the pool, Spread got support for dynamically allocating new systems on demand. This is slightly more expensive per run, but we’ll make up for it by not paying for the whole pool during quiet hours of the day and peaceful weekends.

Support for this is enabled via two new fields under the Linode backend:

backends:
    linode:
        (...)
        plan: 8GB
        location: fremont

These changes are already in the master branch, and our pre-allocated machine pool is gone altogether, so if you need to use Linode for something, please make sure to merge from master first.

Note that the implementation was made in such a way that if a run is killed for whatever reason, the machines will remain live, and the usual pool behavior kicks in. That is, the next run will reuse those machines after the halt timeout has passed, and will then remove the machine altogether at the end, as the former runner should have done.

More workers, larger

Given the above change, I’ve also pumped up the number of workers per node, and allocated workers that are 4 times as large as the ones we were using (2GB plan→8GB plan). I don’t have precise numbers yet, but I believe the savings will be enough for us to break even with the prior monthly cost of the system, or even improve on it a bit. I’ll keep an eye on this over the next weeks.

Task priorities

As I was investigating the timings of our test runs, I’ve realized that in some cases the longer tests were being started closer to the end of the run. The effect of this is that sharing across workers is hindered, because by then there’s nothing else for the other workers to be doing and they just terminate. If instead the larger tasks start earlier, the work stealing model of Spread means the other workers will get busy with the smaller tasks, and by the time the long tasks are finished their respective workers can also churn through the remaining few tasks, with all workers terminating closer to each other.

With that mind, support for a new priority field was added to Spread, and it may be specified at any level of the hierarchy (task, suite, system, backend). The larger the priority the earlier the task is scheduled. The default priority is zero, and negative priorities are supported.

This was then used in our offending tasks that take 2 minutes or more.

Results

Travis is now taking around 23 minutes to run a full round of tests, and hasn’t failed on anything relative to machine allocation, except for initial quota issues that were sorted with Linode.

(side note: @cachio You can see in this build that the snap-service failed. It failed several times during my tests, and seems real. Can you please have a look at this when you have a moment?)

Documentation

All the Spread changes are documented in the usual place:

Issues

Let me know!

cachio · January 24, 2018, 2:25am

Nice change

Yes, I also saw this error in the PRs for the new gpg tests, but so far I couldn’t reproduce it at localhost. I’ll continue trying to reproduce that issue to provide more info.

Thanks

cachio · January 24, 2018, 3:29am

Finally I could reproduce it (just on trusty). The error seems to be genuine, I manually made the reload in the debug session and the status did not show the expected text. In another instance I made the reload and it did how the “reloading reloading reloading” text.

This is the full log. I’ll continue tomorrow working on this one.

https://paste.ubuntu.com/26448447/