Armen Zambrano's battlefield: April 2011

Tuesday, April 19, 2011

Load from March 24th to April 19th

Last week I did a post about how high our load was for that day and to let other people know that we are looking into mitigating the bad wait times that have been happening.

We know that we need more slaves but we also know that our masters are hitting edge cases and not being optimal. We now believe that bug 592244 is behind to some chunk of the wasted CPU by running some jobs twice. The problem comes that we have several masters that query a scheduling master and sometimes two jobs are run in two different masters. catlee has done a great job on chasing this and we hope that fixing this issue will improve significantly the wait times (it would have been hard for us without his help to narrow down this issue). If it does not help us enough to get by we will have to go back and chase other edge cases in our masters. Meanwhile IT and releng is still working on getting the next pool of test slaves.

And now back to the load (link to page with raw data):

on the 11th we handled 138 pushes across all branches (the day before the aurora merge)
try server had a 47.5%, mozilla-central 16.9% and cedar 11.2% (/me looks at ehsan) of the whole load

Conclusions:

even though we had the trip to Las Vegas, the all-hands and platform's work week we have had a very high load since we shipped Firefox 4

I wonder what the distribution from April 18th to the end of the month will look like as it would be more representative of what the normal development would be.

For the next post I should only grab weekdays and interpose them to see how things look from week to week.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Tuesday, April 12, 2011

Yesterday's load

I will do a longer analysis at some point but I would like to share with you a link and a screenshot of it.

These two diagrams show commits over 24 hours (from Mon, 11 Apr 2011 00:00 PDT to Tue, 12 Apr 2011 00:00 PDT) from all of our currently supported project branches. On the first diagram we can see pushes per hour and on the second diagram we can see a distribution of these pushes among the different project branches.

Each one of these commits produce different types of builds and tests. For a given build we can end up queuing up to 14 test suites plus 8 different talos jobs for a given OS.
How easily can the test pool be out of capacity? Three builds of a certain OS finishing around the same time can generate up to 66 testing jobs and take up more than the whole testing pool for that OS (we have 48 to 54 machines per OS) for a variable amount of time. Test jobs can take from 5 minutes to more than 60 minutes depending on the OS and the test suites.

For further information on test times I have some raw data from back in December (out-of-date warning) and three blog posts where I drew conclusions out of it.

This high load of pushes and the conglomeration of pushes (how close they are to each other) make test jobs to be queued and wait to be processed (this can be seen on the daily Wait Time emails on dev.tree-management). We need more machines (and we are working on it) but here are few things that you can do to improve things until then:

Use the TryChooser syntax. Spending a moment to choose a subset of build and test jobs for your change helps to use the right amount of resources. If you need all builds and tests do not hesitate to use it all. Note that at some point this syntax will be mandatory.
Cancel unneeded jobs. Use self-serve (which shows up on tbpl) to stop running or pending jobs once you know that they are not needed because you pushed something incorrectly or it is going to fall Once a build or test is not needed please cancel it to free up resources. Everyone will thank you.

There are also things that could be fixed like improving reftests and xpcshell for Win7 but that is not something that everyone can help in a reasonable amount of time.

[EDIT] 4:15pm PDT - I want to highlight that there is going to be a series of blog posts explaining what is the work and new testing machines purchase that we will be undertaking to handle such bad wait times.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.