Over the last couple of weeks I've been working on automated visual regression testing for Gecko. The idea is to run Gecko through a battery of HTML test files and render each one to an image file. We can then compare the images produced by different Gecko versions; if a new version has different rendering then we've probably either fixed a bug or introduced a new bug (or both!). When layout engineers develop patches, we can see which testcases get changed and hopefully catch any unexpected regressions before checkin. (It's amazing we haven't had this years ago; we've survived until now because we have lots of great volunteers who download nightly builds, test them, and report bugs.)
We already have regression tests based on analyzing the frame tree geometry computed by the layout engine. Visual regression tests have some advantages over comparing those coordinates:
- Visual tests eliminate false positives, because changes in the frame tree that don't affect rendering will not be detected.
- Visual tests eliminate false negatives, because they detect changes in rendering caused by z-ordering, image decoding, graphics layer changes, clipping, and so on, not just layout.
- Visual test results are easier to understand because we can show visually the difference between two runs. Anyone with HTML+CSS knowledge can run visual tests and analyze the output.
- These visual regression tests can run using an optimized build, but our frame tree tests require a debug build, so the visual tests will run 2-3 times faster. Also you'll be able to use standard Mozilla nightly builds to get a baseline, if you want to.
I initially tried to do this using Xvfb to create a virtual screen and taking screenshots of it (both using ImageMagick 'import' and reading the Xvfb virtual screen directly through its exported framebuffer). I had horrible problems with synchronization, trying to ensure that the Xvfb server had actually rendered all the commands Gecko sent to it before I took the screenshots, and eventually gave up on that approach late last week.
Instead I just added a real offscreen rendering API to the view manager and added code to nsDocumentViewer to call it when the environment variable MOZ_FORCE_PAINT_AFTER_ONLOAD is set. With this patch, if you set MOZ_FORCE_PAINT_AFTER_ONLOAD=/tmp/foo, then every time we load a document Gecko spits out a message "GECKO: PAINT FORCED AFTER ONLOAD [url] [file] ([status])", where [url] is the loaded URL, [file] is of the form /tmp/foo-NNN which names a file in PPM format, and [status] is OK if we wrote a file or otherwise some barely-descriptive error token. This code should be useful for various hacks, and hopefully we'll be able to make the view manager API visible to Javascript authors for use in the browser UI and extensions.
That solved a lot of problems but I had another set of problems to grapple with this week: nondeterminism. Basically the regression test approach depends on the fact that every time Gecko renders a page in a window of fixed width and height, we get exactly the same image, down to the details of each pixel. But with trunk code we don't and I just spent a few days figuring out exactly why. It boils down to the following issues:
Dealing with nondeterministic program behaviour is always troublesome and tracking these down was really hellish. I had to instrument the code involved with logging functions, run my 1300 test cases multiple times until I saw two runs with different behaviours, compare the logs to narrow down the cause of the variation, and repeat until I had located the problem. And 1300 testcases with a lot of logging enabled is very unwieldy to analyze.
Anyway after fixing all these issues I now have 100% reproducible rendering for the 1300 testcases in the Mozilla tree. Next week I'll try the 2500 testcases in various online test suites. Hopefully they won't uncover any new issues. Then I'll get this work submitted in Bugzilla and get back to fixing bugs with the help of my shiny new regression test engine.
Currently my script takes about 130 seconds to run 1300 testcases on my 3GHz Xeon dedicated test machine. I have included support for distributing tests across multiple processors and I'm looking forward to seeing how many tests per second I get on my big machine.