We already have regression tests based on analyzing the frame tree geometry computed by the layout engine. Visual regression tests have some advantages over comparing those coordinates:
- Visual tests eliminate false positives, because changes in the frame tree that don't affect rendering will not be detected.
- Visual tests eliminate false negatives, because they detect changes in rendering caused by z-ordering, image decoding, graphics layer changes, clipping, and so on, not just layout.
- Visual test results are easier to understand because we can show visually the difference between two runs. Anyone with HTML+CSS knowledge can run visual tests and analyze the output.
- These visual regression tests can run using an optimized build, but our frame tree tests require a debug build, so the visual tests will run 2-3 times faster. Also you'll be able to use standard Mozilla nightly builds to get a baseline, if you want to.
I initially tried to do this using Xvfb to create a virtual screen and taking screenshots of it (both using ImageMagick 'import' and reading the Xvfb virtual screen directly through its exported framebuffer). I had horrible problems with synchronization, trying to ensure that the Xvfb server had actually rendered all the commands Gecko sent to it before I took the screenshots, and eventually gave up on that approach late last week.
That solved a lot of problems but I had another set of problems to grapple with this week: nondeterminism. Basically the regression test approach depends on the fact that every time Gecko renders a page in a window of fixed width and height, we get exactly the same image, down to the details of each pixel. But with trunk code we don't and I just spent a few days figuring out exactly why. It boils down to the following issues:
- Animated images :-). These have to be turned off in the profile, using
- I use Gtk2+Xft builds, and the Freetype2 autohinter seems nondeterministic; it sometimes produces different glyphs for the same character in the same font. Turning off autohinting and antialiasing via ~/.fonts.conf seems to fix the problem.
- A bug in painting table collapsing-borders where the BCData::mCornerBevel bit is uninitialized, making Gecko occasionally think that a border should be bevelled when it shouldn't and throwing the painting out of alignment. (I also found that border-collapse painting can vary depending on the damage rect involved, but that's not affecting me right now.)
- A bug in nsImageGTK::UpdateCachedImage where it copies image pixels into the image pixmap using gdk_draw_rgb_image_dithalign and specifies a "dither alignment" of rect->x, rect->y. It turns out that GTK already adds alignment to rect->x, rect->y so that we end up with the coordinates double-counted and incorrect alignment. This means when an image gets filled in in one shot we get one dithering, but when the image is filled in in multiple passes (e.g. data loads slowly from the network), the later sections get dithered slightly differently to the one-shot case. So on <24-bit displays, you get slight variations in colouration in some rendered images (not visible to the naked eye), but only when the network is slow.
- Freetype2 has a bug where if you ask it to resize a 13x13 face to a 12.5x12.5 face, it decides that they're close enough already and does nothing, even though creating faces from scratch for 13x13 and 12.5x12.5 would give you slightly different font metrics. Because Xft caches faces internally, this means when you're working with fractional font sizes Xft will sometimes give you varying font metrics depending on whether it's reusing a face in the cache (and hitting the Freetype2 bug when it tries to resize the face) or creating one from scratch. This in turn occasionally causes line heights, and sometimes the intrinsic width of text inputs which is computed from font maxAdvance, to be off by one pixel.
Dealing with nondeterministic program behaviour is always troublesome and tracking these down was really hellish. I had to instrument the code involved with logging functions, run my 1300 test cases multiple times until I saw two runs with different behaviours, compare the logs to narrow down the cause of the variation, and repeat until I had located the problem. And 1300 testcases with a lot of logging enabled is very unwieldy to analyze.
Anyway after fixing all these issues I now have 100% reproducible rendering for the 1300 testcases in the Mozilla tree. Next week I'll try the 2500 testcases in various online test suites. Hopefully they won't uncover any new issues. Then I'll get this work submitted in Bugzilla and get back to fixing bugs with the help of my shiny new regression test engine.
Currently my script takes about 130 seconds to run 1300 testcases on my 3GHz Xeon dedicated test machine. I have included support for distributing tests across multiple processors and I'm looking forward to seeing how many tests per second I get on my big machine.