Monday, 30 November 2015

Even More rr Replay Performance Improvements!

While writing my last blog post I realized I should try to eliminate no-op reschedule events from rr traces. The patch turned out to be very easy, and the results are impressive:

Now replay is faster than recording in all the benchmarks, and for Mochitest is about as fast as normal execution. (As discussed in my previous post, this is probably because the replay excludes some code that runs during normal execution: the test harness and the HTTP server.) Hopefully this turns into real productivity gains for rr users.

Saturday, 28 November 2015

rr Replay Performance Improvements

I've been spending a lot of time using rr, as have some other Mozilla developers, and it occurred to me a small investment in speeding up the debugging experience could pay off in improved productivity quite quickly. Until recently no-one had ever really done any work to speed up replay, so there was some low-hanging fruit.

During recording we avoid trapping from tracees to the rr process for common syscalls (read, clock_gettime and the like) with an optimization we call "syscall buffering". The basic idea is that the tracee performs the syscall "untraced", we use a seccomp-bpf predicate to detect that the syscall should not cause a ptrace trap, and when the syscall completes the tracee copies its results to a log buffer. During replay we do not use seccomp-bpf; we were using PTRACE_SYSEMU to generate a ptrace trap for every syscall and then emulating the results of all syscalls from the rr process. The obvious major performance improvement is to avoid generating ptrace traps for buffered syscalls during replay, just as we do during recording.

This was tricky to do while preserving our desired invariants that control flow is identical between recording and replay, and data values (in application memory and registers) are identical at all times. For example consider the recvmsg system call, which takes an in/out msg parameter. During recording syscall wrappers in the tracee would copy msg to the syscall log buffer, perform the system call, then copy the data from the log buffer back to msg. Hitherto, during replay we would trap on the system call and copy the saved buffer contents for that system call to the tracee buffer, whereupon the tracee syscall wrappers would copy the data out to msg. To avoid trapping to rr for a sequence of such syscalls we need to copy the entire syscall log buffer to the tracee before replaying them, but then the syscall wrapper for recvmsg would overwrite the saved output when it copies msg to the buffer! I solved this, and some other related problems, by introducing a few functions that behave differently during recording and replay while preserving control flow and making sure that register values only diverge temporarily and only in a few registers. For this recvmsg case I introduced a function memcpy_input_parameter which behaves like memcpy during recording but is a noop during replay: it reads a global is_replay flag and then does a conditional move to set the source address to the destination address during replay.

Another interesting problem is recapturing control of the tracee after it has run a set of buffered syscalls. We need to trigger some kind of ptrace trap after reaching a certain point in the syscall log buffer, without altering the control flow of the tracee. I handled this by generating a large array of stub functions (each only one byte, a RET instruction) and after processing the log buffer entry ending at offset O, we call stub function number O/8 (each log record is at least 8 bytes long). rr identifies the last log entry after which it wants to stop the tracee, and sets a breakpoint at the appropriate stub function.

It took a few late nights and a couple of half-days of debugging but it works now and I landed it on master. (Though I expect there may be a few latent bugs to shake out.) The results are good:

This shows much improved replay overhead for Mochitest and Reftest, though not much improvement on Octane. Mochitest and Reftest are quite system-call intensive so our optimization gives big wins there. Mochitests spend a significant amount of time in the HTTP server, which is not recorded by rr, and therefore zero-overhead replay could actually run significantly faster than normal execution, so it's not surprising we're already getting close to parity there. Octane replay is dominated by SCHED context-switch events, each one of which we replay using relatively expensive trickery to context-switch at exactly the right moment.

For rr cognoscenti: as part of eliminating traps for replay of buffered syscalls, I also eliminated the traps for the ioctls that arm/disarm the deschedule-notification events. That was relatively easy (just replace those syscalls with noops during replay) and actually simplified code since we don't have to write those events to the trace and can wholly ignore them during replay.

There's definitely more that can be squeezed out of replay, and probably recording as well. E.g. currently we record a SCHED event every time we try to context-switch, even if we end up rescheduling the thread that was already running (which is common). We don't need to do that, and eliminating those events would reduce syscallbuf flushing and also the number of ptrace traps taken during replay. This should hugely benefit Octane. I'm trying to focus on easy rr improvements with big wins that are likely to pay off for Mozilla developers in the short term; it's difficult to know whether any given improvement is in that category, but I think SCHED elision during recording probably is. (We used to elide recorded SCHED events during replay, but that added significant complexity to reverse execution so I took it out.)

Monday, 16 November 2015

Debugging Leaks With rr

Last week I looked into a couple of Gecko leaks using rr. One fun aspect of using rr is discovering new debugging techniques. "What's the best way to use a record-and-replay debugger to track down memory leaks?" is a very interesting question. Maybe no-one has tried to do this before.

XPCOM_MEM_LEAK_LOG uses instrumentation to count object constructors and destructors and report any mismatches. That tells us immediately which classes of objects leaked. My next step was to identify the address of a leaked object, preferably one that is the root of an ownership tree of leaked objects. In my particular case I chose to investigate a leaked nsWindow, because only one was leaked and I guessed (correctly, as it turns out) that it was most likely to be the root of the leaked object tree.

Finding the address of the leaked nsWindow was easy: run the program with breakpoints on the nsWindow constructor and destructor, set to log the object address and continue. The constructor with no matching destructor has the address of the leaked object. At process exit I could inspect the window object and see that indeed it still existed with refcount 1.

My next move was to use a similar technique to log addref/release operations on that window. This is similar to what you'd get with Gecko's built-in trace-refcount instrumentation, but produced a lot of data that was hard to analyze. I realized the leaked reference to the window probably corresponds to a pointer to the window in memory when the process exits. So I added a real-tid command to rr to print the real thread-id of a replaying process, and wrote a small program to walk /proc/.../maps and /proc/.../mem to search all process memory for the pointer value. This worked very well and showed just two addresses where the pointer occurred.

The next problem was to figure out what those addresses belong to. I futzed around a bit before finding the correct approach: set hardware write watchpoints on those addresses and reverse-execute from the end of the process to find the writes that stored those addresses. This told me immediately where and why each reference was being stored, and if it was a strong reference. This made the bug obvious.

A lot of this could be more automated. It would be particularly interesting to grab the XPCOM_MEM_LEAK_LOG output and process it to automatically output the "leak roots" --- leaking references that are not contained in objects that themselves leaked. If there are no leak roots then instead there must be at least one cycle of strong references, which we can output.

TPPA Protest

On Saturday I participated in a protest march for the first time ever --- against the TPPA. I support lowering trade barriers, but the TPPA has a lot of baggage that I strongly dislike, such committing member nations to a dysfunctional United States-esque intellectual property regime, and sovereignty-eroding "investor-state dispute settlement". The biggest prize for New Zealand would have been free access to foreign dairy markets, but that was mostly not realized, so it seems like a bad deal for us.

Unsurprisingly there were a lot of different sorts of people involved. Many of them espoused themes I don't agree with --- knee-jerk anti-Americanism, "Socialist Aotearoa", general opposition to free-market economics. That made it more fun and interesting :-). I think it's very important that people who disagree about a lot of things can still work together on issues they do agree about.

Tuesday, 10 November 2015

Perfection In Imperfection

I like this quote from Ken Liu (translator of Cixin Liu's The Three-Body Problem into English):

The best translations into English do not, in fact, read as if they were originally written in English. The English words are arranged in such a way that the reader sees a glimpse of another culture's patterns of thinking, hears an echo of another language's rhythms and cadences, and feels a tremor of another people's gestures and movements.

This principle definitely applies to his translation of The Three-Body Problem. It contributed a lot to my enjoyment of the book. (Overall I thought it was good, though a little disappointing for a Hugo award winner.)

It occurred to me that I've never noticed this principle at work while reading the Bible in English. I think that's because the Bible is its own genre --- there's almost nothing written originally in English to compare it to. And then, anything remotely comparable originating in English would almost certainly been deeply influenced by the Bible.

Thursday, 5 November 2015

rr In VMWare: Solved!

Well, sort of.

I previously explained how rr running in a VMWare guest encounters a VMWare bug where performance counters measuring conditional branches executed fail to count a conditional branch executed between a pair of CPUID instructions. Today a helpful VWMare developer confirmed that this is an issue caused by clever optimizations in the virtual machine monitor. Better still, he provided an easy way to work around it! Just add

monitor_control.disable_hvsim_clusters = true
to your .vmx file. This may impact guest performance a little bit, but it will make rr work reliably in a VMWare guest, which should be useful to a lot of people. Excellent!

Monday, 2 November 2015

An Extraordinary Sunday

Sunday was extraordinary. I got up at 4am, took the ferry to Devonport, gathered with a few thousand runners to watch the All Blacks winning the Rugby World Cup final, ran the Auckland half marathon barefoot, had a quick shower at home, made it to church on time to worship the Creator of the universe, taught Sunday school (on the transfiguration of Jesus), ate tasty Vietnamese food at Hansan, took my son to his water polo game, then to his cousin's pool party, returned home to rewatch the final, cooked dinner, tried to figure out an rr bug, and collapsed into bed.

I'd hardly done any training for the half marathon, initially because I had had a sore knee that I was afraid of aggravating and later because I was too busy with tramping trips and other activities. Yesterday I had intended to just try to finish in under two hours, but I started a bit too quickly and by the time I realized it, there was nothing to be done but push on. I finished in 1:48, five minutes faster than my previous best (four net). I paid for it though; keeping going was mentally harder than anything I've done before.

It was a fantastic day altogether. The experience of being in Devonport at dawn watching the rugby before the run was something I'll always remember.