Friday 13 November 2009
Sword +5 Against Orange
Today for the first time I seriously used VMWare's record-and-replay feature for debugging. Chris Pearce had set up the mochitest harness to run test_access_controls.html over and over until it failed, to try to catch the "random orange" where the test has been intermittently failing on Tinderbox. He caught a failure after about 30 minutes of continuous test running in a recording VM a couple of days ago. (Each test run takes about 15 seconds, so this represents over a hundred test runs before we reproduced the failure.) Today I got around to debugging it.
Chris had captured an output log and had added some code to nsGlobalWindow::Dump to number the dump() messages. Then we can set a conditional breakpoint in nsGlobalWindow::Dump to stop whenever a particular message is about to be printed. A sensible thing to do is to break when we output the last "TEST PASSED" message before the first test failure. It took almost an hour to replay up to that point (replay is a bit slower than recording), so another sensible thing to do is to then immediately take a VM snapshot so future debugging can resume from that point relatively quickly as often as you like. It takes about two minutes to resume replay from a snapshot.
From then on it was fairly similar to a normal debugging session: setting breakpoints, looking at variables, rerunning the program as you work backwards towards the cause. Except that you don't have to worry about the execution being different and the bug not showing up. You know exactly what the log output is going to be, right down to the addresses of objects dumped in the log, so it's easy to set conditional breakpoints to stop on log events you want to look into.
The whole experience is pretty much what I expected: fabulous! It would be nice if the couple of minutes to restart execution from a snapshot could be reduced, and some of the other UI operations feel sluggish and heavyweight, but this was certainly by far the best way to debug what turned out to be a complex and very hard to reproduce failure. It's a good feeling to know that whatever happens, you will be able to go back over this execution again; it takes the fear out of debugging, and replaces it with confidence that you *will* be able to find the bug. My hat's off to E Lewis and the rest of the VMWare team.
We need to do a bit more work to optimize and document our setup, get some more experience with these features and perhaps get some more automation for running tests, catching failures and building that vital initial snapshot. But I'm pretty confident that soon this will be an essential tool for us.
Comments
Gerv
I'll look into VMWare for that form of debugging for my own work (Hmmm, does it work in Fusion? I'll have to find out...)
I'm still hurting from the lack of Valgrind on Mac 10.6 in the meantime, but this will give me something to look at in the meantime!
Mook: you might be able to do something from the guest, but it'd be tricky because you don't want the guest to issue commands during recording and have those same commands issued during replay...
It seems like VMWare could supply some guest tools to let you do that from within the guest OS in record mode, but have it be harmless when running in replay mode.
One way that's pretty easy is to tee the test log to a file on the host file system mounted in the guest and have a host process stop recording when we see a test failure in that log file. During replay the guest is not allowed to affect the outside world so that file doesn't change.