For several months I've noticed that when my Firefox session gets bloated, about:memory shows zombie compartments associated with the nytimes.com site. I don't visit that site very often, but over a week or three a small subset of the pages that I visit there are leaked. I worried about it but didn't know how to track it down. It's hard to debug problems that arise over a period of weeks. (It's a testament to the reliability of our nightly builds though!)
Yesterday and today I discovered that by clicking around a large set of nytimes.com pages I could reproduce the problem in a reasonably short amount of time with a fresh profile and a small set of tabs open. Furthermore, with about:ccdump and some help from Olli Pettay, I was able to identify the particular element which forced the cycle collector to keep everything else alive. The next step was to figure out where we added references to that element that the cycle collector didn't know about. It's a hard debugging problem because we don't know which element is the problematic element until long after the references have been added
I tried a few approaches but tonight I finally found a successful one. I made a Windows full memory dump of the errant Firefox process, and attached to it with WinDbg. In the still-running Firefox process I then used about:ccdump to identify the address of the leaking DOM object. I was then able to use WinDbg's "s" command to search the entire memory space for occurrences of that address. Then I had to identify each occurrence. For the references the cycle collector knows about, this is easy: about:ccdump reveals the addresses of the referencing objects so you know those memory areas will contain the address of the leaking object. For other occurrences I had to dig around in nearby memory to figure out what sort of object contained the occurrence. Sometimes it's easy because there's a nearby vtable pointer and WinDbg can tell you which class the vtable pointer belongs to (using the "ln" command). Other cases were trickier. For example I found a reference to the leaking object next to a reference to an XPCNativeWrapper, in what looked like an array of similar pairs, and guessed that this was part of the hash table that maps DOM objects to their JS wrappers.
Anyway, I finally identified two references to the leaking DOM object from inside an nsFrameSelection object. These references were buried inside a copy of an nsMouseEvent that nsFrameSelection keeps for obscure reasons --- and did not tell the cycle collector about. Having identified the problem, creating a fix was easy since we don't really need to keep that copy around at all. Hopefully it'll land soon.
WinDbg has awful usability but is quite powerful and can do useful things that Visual Studio's debugger can't. I'm a bit fond of it as the distant descendant of DEBUG.COM and SYMDEB.EXE, which I spent a lot of time with while reverse-engineering and patching MS-DOS binaries (i.e., stripping copy protection from games) in my misspent youth. Too bad the syntax hasn't improved over the years!