Saturday, 14 November 2009

Sword +5 Against Orange

Today for the first time I seriously used VMWare's record-and-replay feature for debugging. Chris Pearce had set up the mochitest harness to run test_access_controls.html over and over until it failed, to try to catch the "random orange" where the test has been intermittently failing on Tinderbox. He caught a failure after about 30 minutes of continuous test running in a recording VM a couple of days ago. (Each test run takes about 15 seconds, so this represents over a hundred test runs before we reproduced the failure.) Today I got around to debugging it.

Chris had captured an output log and had added some code to nsGlobalWindow::Dump to number the dump() messages. Then we can set a conditional breakpoint in nsGlobalWindow::Dump to stop whenever a particular message is about to be printed. A sensible thing to do is to break when we output the last "TEST PASSED" message before the first test failure. It took almost an hour to replay up to that point (replay is a bit slower than recording), so another sensible thing to do is to then immediately take a VM snapshot so future debugging can resume from that point relatively quickly as often as you like. It takes about two minutes to resume replay from a snapshot.

From then on it was fairly similar to a normal debugging session: setting breakpoints, looking at variables, rerunning the program as you work backwards towards the cause. Except that you don't have to worry about the execution being different and the bug not showing up. You know exactly what the log output is going to be, right down to the addresses of objects dumped in the log, so it's easy to set conditional breakpoints to stop on log events you want to look into.

The whole experience is pretty much what I expected: fabulous! It would be nice if the couple of minutes to restart execution from a snapshot could be reduced, and some of the other UI operations feel sluggish and heavyweight, but this was certainly by far the best way to debug what turned out to be a complex and very hard to reproduce failure. It's a good feeling to know that whatever happens, you will be able to go back over this execution again; it takes the fear out of debugging, and replaces it with confidence that you *will* be able to find the bug. My hat's off to E Lewis and the rest of the VMWare team.

We need to do a bit more work to optimize and document our setup, get some more experience with these features and perhaps get some more automation for running tests, catching failures and building that vital initial snapshot. But I'm pretty confident that soon this will be an essential tool for us.



9 comments:

  1. So what was the bug? :-)
    Gerv

    ReplyDelete
  2. what's the bug number of the orange you were hunting?

    ReplyDelete
  3. So where was the bug and why did it happen so rarely?

    ReplyDelete
  4. Robert O'Callahan14 November 2009 22:22

    526323.

    ReplyDelete
  5. Doesn't VMware have some sort of talk-to-the-host thing? Just wondering if, instead of the VM looping, it would make sense to issue a command from inside that restored the snapshot to before the run instead (so that you'd never have more than one run to wait through).

    ReplyDelete
  6. Nice work and nice technique!
    I'll look into VMWare for that form of debugging for my own work (Hmmm, does it work in Fusion? I'll have to find out...)
    I'm still hurting from the lack of Valgrind on Mac 10.6 in the meantime, but this will give me something to look at in the meantime!

    ReplyDelete
  7. Robert O'Callahan15 November 2009 10:13

    Edouard: it doesn't work in Fusion, not yet anyway.
    Mook: you might be able to do something from the guest, but it'd be tricky because you don't want the guest to issue commands during recording and have those same commands issued during replay...

    ReplyDelete
  8. So how did you notice that the orange had happened, just manual monitoring? It would be interesting, as Mook said, to figure out some way to be smarter about the record/replay times. Perhaps run buildbot or something like it on the host machine, with a slave in the guest machine, and roll back the guest's snapshot before every run? Then if the test fails, you can save another snapshot from the host machine.
    It seems like VMWare could supply some guest tools to let you do that from within the guest OS in record mode, but have it be harmless when running in replay mode.

    ReplyDelete
  9. Robert O'Callahan16 November 2009 22:56

    Yeah, Chris P just stopped the recording manually.
    One way that's pretty easy is to tee the test log to a file on the host file system mounted in the guest and have a host process stop recording when we see a test failure in that log file. During replay the guest is not allowed to affect the outside world so that file doesn't change.

    ReplyDelete