Monday 8 December 2008
Reftests
Before I get into what this post is really about, let me heap praise on reftests. David Baron came up with the idea of writing automated tests for layout and rendering where each test comprises two pages and the test asserts that the renderings of the two pages are identical. This works much better than comparing test pages to reference images (although you can use an image as a reference if you want), because you can easily write tests that work no matter what fonts are present, or what the platform form controls look like, or what platform antialiasing behaviour is, and so on. There are almost always many ways to achieve a particular rendering effect in a Web page, so it's very easy to write reftests for parsing, layout, and many rendering effects. There are also tricks we've learned to overcome some problems; for example, if there are a few pixels in the page whose rendering is allowed to vary, you can exclude them from being tested just by placing a "censoring" element over them in both the test and reference pages. In dire circumstances we can even use SVG filters to pixel-process test output to avoid spurious failures. Sometimes when there are test failures that aren't visible to the naked eye (e.g. tiny differences in color channel values), it's tempting to introduce some kind of tolerance threshold to the reftest framework, but so far we've always been able to tweak the tests to avoid those problems and so I strongly resist adding tolerances. They should not be needed, and adding them would open a big can of worms.
The reftest approach may seem obvious but it definitely isn't, because other browsers don't seem to use it. I don't know why. Comparing against reference images only makes sense in very limited circumstances. Dumping an internal representation of the page layout and comparing that against a reference makes life difficult if you want to change your internal representation, and skips testing a lot of the rendering path. Reftests don't even depend on which engine you're using --- you can run most of our reftests in any browser. One argument that has been made against reftests is that someone might introduce a bug that breaks the test and the reference in the same way, so tests pass. That is possible, but if feature X regresses and all reftests pass, that just means you should have had a test specifically for feature X (where the test page uses feature X but the reference doesn't).
Anyway, the problem at hand: quite frequently we fix bugs where a particular page is triggering pathological performance problems. For example, we might switch to a slightly more complex algorithm with lower asymptotic performance bounds, or we might be a little more careful about caching or invalidating some data. Unfortunately we don't have any good way to create automated tests for such fixes. Our major performance test frameworks are not suitable for this because these pages are not so important that we will refuse to accept any regression; we just want to make sure they don't get "too bad". Also we don't want to have hundreds of numbers that must be manually compared, so it's not clear how to automatically choose baseline numbers for comparison.
One crazy idea I've had is performance reftests. So you create a test page and a reference, and the test asserts that the test page execution time is within some constant factor of the reference's execution time. In this case we would definitely need to introduce a tolerance threshold. One problem is that a slow machine getting temporarily stuck (e.g. in pageout) would easily cause a spurious failure. So perhaps we could measure metrics other than wall-clock execution time. For example, we could measure user-mode CPU times and assert they match within a constant factor. We could instrument the memory allocator and assert that memory footprint is within a constant factor. I'm not really sure if this would work, but it would be an interesting project for someone to experiment on, perhaps in an academic setting.
Comments
In a nutshell, the goal is to be able to easily add arbitrary microbenchmarks to our test framework, track/graph results over time, and automate regression detection. Lots of browser performance sensitive things are not captured by Tp/Ts/Tsvg/etc, so this would make it easy to create independent new tests that can focus on specific areas (eg: awesomebar results, toolkit service apis, etc).
Hmm, you give me an idea... Taken to an extreme, why not measure the completion time of all existing reftests, xpcshell tests, mochitests, etc? That would instantly provide a whole boatload of performance metrics for free. (This is where the automated regression detection obviously becomes important!) It's not a panacea and has some issues, but it is a tempting source of low-hanging metric fruit. (Or, in the US, imperial fruit?)
But I can't think of usefull baselines there. With hundreds of reftests the impact of one test may be too small. And maybe some tests have more impact as they are generally more complex.
One of the major advantages of this scheme is that it's pass/fail and can probably run on VMs.