On fluctuations in performance testing results · 2011-04-08 10:25 by Wladimir Palant
Yesterday I concluded that (with all bugs fixed) the results of Mozilla’s add-on performance measurements shouldn’t fluctuate by more than 2% of the Firefox startup time. Sorry, I was wrong. Later that day I noticed Read It Later, currently #43 on that list, supposedly causing 4% slower Firefox startup times. Yet this extension was definitely disabled during performance testing due to bug 648229. Does this now mean that each disabled add-on causes a 4% performance impact? Definitely not, disabled add-ons have no measurable effect on performance. So digging up the raw numbers for that add-on was definitely a good idea. Here they come:
|Test run||Reference time (no extensions)||Read It Later 2.1.1|
|Windows 7 on March 26th||548.89||549.00||+0.0%|
|Windows 7 on April 2nd||541.89||617.68||+14.0%|
|Windows XP on March 26th||399.79||399.63||-0.0%|
|Windows XP on April 2nd||401.21||402.79||+0.4%|
|Mac OS X on March 26th||694.79||690.58||-0.6%|
|Mac OS X on April 2nd||699.58||699.58||+0.0%|
|Fedora Linux on March 26th||498.37||494.05||-0.9%|
|Fedora Linux on April 2nd||495.95||511.63||+3.2%|
Most results indeed show something very close to the reference time which makes sense. However, on Fedora this extension supposedly caused 3% slowdown during the second run. On Windows 7 it even got to 14%. Here are the results for the individual measurements that this average consists of:
The first measurement is always significantly higher and is ignored for the average, I already mentioned that. If you look at the other measurements however— the times were pretty close to the reference value (as expected), and then something changed and the numbers got 200 ms higher. Had this happened at the beginning of the test run, this would have increased the extension’s score on Windows 7 by 35%! Even if the measurements on all other platforms were correct (Fedora’s wasn’t), it would translate into 9% more in the overall score. In fact, I suspect that this is exactly what happened to the add-on that has been tested right after it.
Nils Maier thinks that such fluctuations have something to do with other jobs running on the same machine, particularly ones doing heavy I/O. I am no expert on this, so I can neither agree nor disagree. Dear AMO, please clear this up before you start stigmatizing add-ons as “slow”.
Update: I looked into the results of the other add-ons tested on the same machine (talos-r3-w7-020). Both the add-on tested before Read It Later (Flagfox) and the one tested after it (SimilarWeb) show the same irregularities — the individual measurements were first pretty close to the test results from the previous week and became significantly higher towards the end of the test. This brought Flagfox 14% more on Windows 7 (4% more in the overall score) and SimilarWeb got 8% more (2% more in the overall score). In case of SimilarWeb these 2% were enough to push it into the list of Top 10 worst offenders.
Update2: Nils wrote a script to check the standard deviation of the performance measurements, you can see his script and the results here: https://gist.github.com/909583. I could reproduce his results with my own script, checking the logs for all platforms. Out of 100 add-ons, twelve add-ons weren’t tested at all, most likely because the download failed (different download packages for different platforms). Four add-ons were only tested partially (NoScript and Adblock Plus only on Windows 7/XP, BetterPrivacy only on OS X and Fedora, Web of Trust only on Windows 7/XP and Fedora). In all four cases the test timed out because the browser couldn’t be closed, most likely because of first-run pages. Plus Nils found five add-ons with negative impact — these were definitely tested in a disabled state but there are likely many more.
As to the remaining, measurements for 16 add-ons show high standard deviation (more than 10% of the average startup time). Given that these irregularities only appear on one or two platforms it should be safe to exclude the extension itself as the source of the deviations. One extreme case seems to be Tree Style Tab that was tested in a disabled state — yet one of the measurements on Windows 7 was whooping 500% above reference time. Similar scenario happened with Forecastfox on OS X, it had multiple ouliers with one of them being 110% above reference time. Both add-ons didn’t make the list because they got a negative score on at least one other platform so their results were ignored (this happened to most extensions that were tested in a disabled state). The addons with high deviations that made the list are (starting with the highest standard deviation): Read It Later, StumbleUpon, RSS Ticker, FastestFox, Flagfox, Download YouTube Videos (both Windows 7 and Fedora results are suspicious), Personas Plus, SimilarWeb (also both Windows 7 and Fedora suspicious), CoolPreviews.
Commenting is closed for this article.