Sunday 15 November 2015
Last week I looked into a couple of Gecko leaks using rr. One fun aspect of using rr is discovering new debugging techniques. "What's the best way to use a record-and-replay debugger to track down memory leaks?" is a very interesting question. Maybe no-one has tried to do this before.
XPCOM_MEM_LEAK_LOG uses instrumentation to count object constructors and destructors and report any mismatches. That tells us immediately which classes of objects leaked. My next step was to identify the address of a leaked object, preferably one that is the root of an ownership tree of leaked objects. In my particular case I chose to investigate a leaked nsWindow, because only one was leaked and I guessed (correctly, as it turns out) that it was most likely to be the root of the leaked object tree.
Finding the address of the leaked nsWindow was easy: run the program with breakpoints on the nsWindow constructor and destructor, set to log the object address and continue. The constructor with no matching destructor has the address of the leaked object. At process exit I could inspect the window object and see that indeed it still existed with refcount 1.
My next move was to use a similar technique to log addref/release operations on that window. This is similar to what you'd get with Gecko's built-in trace-refcount instrumentation, but produced a lot of data that was hard to analyze. I realized the leaked reference to the window probably corresponds to a pointer to the window in memory when the process exits. So I added a real-tid command to rr to print the real thread-id of a replaying process, and wrote a small program to walk /proc/.../maps and /proc/.../mem to search all process memory for the pointer value. This worked very well and showed just two addresses where the pointer occurred.
The next problem was to figure out what those addresses belong to. I futzed around a bit before finding the correct approach: set hardware write watchpoints on those addresses and reverse-execute from the end of the process to find the writes that stored those addresses. This told me immediately where and why each reference was being stored, and if it was a strong reference. This made the bug obvious.
A lot of this could be more automated. It would be particularly interesting to grab the XPCOM_MEM_LEAK_LOG output and process it to automatically output the "leak roots" --- leaking references that are not contained in objects that themselves leaked. If there are no leak roots then instead there must be at least one cycle of strong references, which we can output.