Sunday, 23 December 2007

Drive-By Debugging

My trunk dogfood build was feeling sluggish yesterday so I thought I should have a poke to see why. I'd noticed sluggish episodes on and off over the last week or two so I thought there might be some kind of problem. "top" showed firefox-bin using 75% CPU even when I wasn't doing anything in the browser, and there were no animations or other obvious activity going on in my windows, so something was definitely wrong.

Fortunately I always dogfood debug builds so it was just a matter of attaching gdb... The first thing I did was poor-man's profiling, just hit ctrl-C to interrupt the process several times and look at the stacks. It was quickly obvious that we were constantly in painting code. I fired up QuartzDebug and confirmed that my GMail window was being fully repainted many times a second.

The next step was to figure out what was triggering the painting. I set a breakpoint on nsIFrame::Invalidate, which is what most Gecko code goes through to force part of a window to be eventually repainted. That breakpoint wasn't being hit very often, so it wasn't related to the problem. I then looked for lower-level Cocoa APIs that could be being used to force the window to repaint, and discovered nsChildView::Invalidate calling those APIs. So I set a breakpoint in nsChildView::Invalidate. Bingo! That was being called very frequently, invalidating a 2000x2000 pixel area of the window.

The stack for that call showed the whole story; you can see it here. Basically the window was being invalidated by a Flash plugin instance requesting a repaint. The Flash plugin was responding to an "idle" event which we send to the plugin at a fairly high rate; each one of these events was causing the plugin to request a repaint of the window. I'm not really sure of the purpose of these events, but it's part of the NPAPI plugin API on Mac that we send them frequently. I'm even more unsure why Flash was requesting repaints of the whole window all the time; it was easy to see that the plugin's layout frame was only 100x100 pixels, and furthermore, the plugin was hidden. I think GMail only uses Flash to play sounds. I suspect there's a deeper bug here somewhere in the intersection of Flash, Gecko and GMail. Eep. It could be related to Quartz drawing model support that we added in the last year that has some other known nasty bugs (whether on our side or the Flash side, again I'm not sure).

Anyway if the plugin is hidden, we should be able to ignore its invalidate requests, working around this bug and maybe other addressing other performance issues. It turns out that the invalidate requests pass through nsPluginInstanceOwner, which has a mWidgetVisible field which is false in this case because we know the plugin isn't visible. It's a one-line patch to drop invalidation requests on the floor when this field is false, and that's what we're going to do. I couldn't be sure of reproducing the bug in a fresh Gecko build but I used a gdb conditional breakpoint with an associated command to simulate the fix in my running build, and it worked.

Episodes of unexplained sluggishness that are hard to reproduce are very frightening from my point of view, because they can really hurt the user experience but it's very hard to track them down; they don't leave stack traces and often there are no steps to reproduce so it's hard to write up a useful bug report. The moral of my story, I suppose, is to dogfood a debug build and if it seems to suddenly be afflicted by a problem like this, be willing to attach a debugger, collect some stacks and do whatever else is necessary to track down the problem ... because you may be our best chance to find and fix the bug before millions of users hit it.



9 comments:

  1. Dude! I have a co-worker who can make a computer stop working just by walking past it.
    I don't know what that's called, but you're the exact opposite of her :)

    ReplyDelete
  2. Cool! This issue was actually already reported in bug 407456. Any tips on how we could have helped gotten that bug noticed?

    ReplyDelete
  3. Robert O'Callahan23 December 2007 20:53

    Hi Manu! The first step to getting a bug noticed is to nominate it for blocking status. The other thing in this case would have been to reassign it to a Core component when Gecko developers are vastly more likely to be watching.

    ReplyDelete
  4. In addition to what roc said here and in the bug, any time there is a problem that deals with something in a web page, it belongs in a Core component instead of an application (Firefox, Thunderbird, SeaMonkey, Camino, etc.) component. This is doubly true for any such bug that is filed in an application’s “General” component; the “General” components often are components where bugs go to get lost ;)

    ReplyDelete
  5. Soon enough I think we'll have some kind of symbol server for OSX builds, so that you can debug/profile all nightlies -- the ability to do this on win32 is pretty fantastic.

    ReplyDelete
  6. Why didn't you use Shark?

    ReplyDelete
  7. Robert O'Callahan24 December 2007 22:16

    Bob: Shark's a great tool but for really obvious performance issues ("why am I doing something instead of nothing") gdb is good enough, and gdb offers the normal debugging features as well like showing the values of local variables, setting breakpoints, etc.

    ReplyDelete
  8. Can you really stop sending idle events to flash plugins? I thought that people had become so upset with Safari slowing flash down when windows were in the background (perhaps having a video playing while typing something elsewhere?) that Safari had to send them at the same rate regardless: http://webkit.org/blog/96/background-music/ . Perhaps that doesn't apply if the plugin has been scrolled off screen though....

    ReplyDelete
  9. Robert O'Callahan30 January 2008 18:22

    We haven't changed the idle-event-sending behaviour.

    ReplyDelete