Saturday 28 November 2015
rr Replay Performance Improvements
I've been spending a lot of time using rr, as have some other Mozilla developers, and it occurred to me a small investment in speeding up the debugging experience could pay off in improved productivity quite quickly. Until recently no-one had ever really done any work to speed up replay, so there was some low-hanging fruit.
During recording we avoid trapping from tracees to the rr process for common syscalls (read, clock_gettime and the like) with an optimization we call "syscall buffering". The basic idea is that the tracee performs the syscall "untraced", we use a seccomp-bpf predicate to detect that the syscall should not cause a ptrace trap, and when the syscall completes the tracee copies its results to a log buffer. During replay we do not use seccomp-bpf; we were using PTRACE_SYSEMU to generate a ptrace trap for every syscall and then emulating the results of all syscalls from the rr process. The obvious major performance improvement is to avoid generating ptrace traps for buffered syscalls during replay, just as we do during recording.
This was tricky to do while preserving our desired invariants that control flow is identical between recording and replay, and data values (in application memory and registers) are identical at all times. For example consider the recvmsg system call, which takes an in/out msg parameter. During recording syscall wrappers in the tracee would copy msg to the syscall log buffer, perform the system call, then copy the data from the log buffer back to msg. Hitherto, during replay we would trap on the system call and copy the saved buffer contents for that system call to the tracee buffer, whereupon the tracee syscall wrappers would copy the data out to msg. To avoid trapping to rr for a sequence of such syscalls we need to copy the entire syscall log buffer to the tracee before replaying them, but then the syscall wrapper for recvmsg would overwrite the saved output when it copies msg to the buffer! I solved this, and some other related problems, by introducing a few functions that behave differently during recording and replay while preserving control flow and making sure that register values only diverge temporarily and only in a few registers. For this recvmsg case I introduced a function memcpy_input_parameter which behaves like memcpy during recording but is a noop during replay: it reads a global is_replay flag and then does a conditional move to set the source address to the destination address during replay.
Another interesting problem is recapturing control of the tracee after it has run a set of buffered syscalls. We need to trigger some kind of ptrace trap after reaching a certain point in the syscall log buffer, without altering the control flow of the tracee. I handled this by generating a large array of stub functions (each only one byte, a RET instruction) and after processing the log buffer entry ending at offset O, we call stub function number O/8 (each log record is at least 8 bytes long). rr identifies the last log entry after which it wants to stop the tracee, and sets a breakpoint at the appropriate stub function.
It took a few late nights and a couple of half-days of debugging but it works now and I landed it on master. (Though I expect there may be a few latent bugs to shake out.) The results are good:
This shows much improved replay overhead for Mochitest and Reftest, though not much improvement on Octane. Mochitest and Reftest are quite system-call intensive so our optimization gives big wins there. Mochitests spend a significant amount of time in the HTTP server, which is not recorded by rr, and therefore zero-overhead replay could actually run significantly faster than normal execution, so it's not surprising we're already getting close to parity there. Octane replay is dominated by SCHED context-switch events, each one of which we replay using relatively expensive trickery to context-switch at exactly the right moment.
For rr cognoscenti: as part of eliminating traps for replay of buffered syscalls, I also eliminated the traps for the ioctls that arm/disarm the deschedule-notification events. That was relatively easy (just replace those syscalls with noops during replay) and actually simplified code since we don't have to write those events to the trace and can wholly ignore them during replay.
There's definitely more that can be squeezed out of replay, and probably recording as well. E.g. currently we record a SCHED event every time we try to context-switch, even if we end up rescheduling the thread that was already running (which is common). We don't need to do that, and eliminating those events would reduce syscallbuf flushing and also the number of ptrace traps taken during replay. This should hugely benefit Octane. I'm trying to focus on easy rr improvements with big wins that are likely to pay off for Mozilla developers in the short term; it's difficult to know whether any given improvement is in that category, but I think SCHED elision during recording probably is. (We used to elide recorded SCHED events during replay, but that added significant complexity to reverse execution so I took it out.)