Monday 6 June 2016
Brendan Dolan-Gavitt asked me to write about how I debug rr itself. That's an interesting topic because it's a challenging problem; as I wrote a while ago, you sometimes feel you're in debugging hell so that others can go to heaven.
Brendan was talking about divergence bugs, i.e. bugs where replaying a recorded execution goes "off the rails" because there's some source of nondeterminism that differs between recording and replay. These can be further divided into recording bugs, where nondeterminism wasn't recorded correctly, and replay bugs, where there's enough information in the trace to replay but we replayed incorrectly.
rr saves the values of all general-purpose registers, the program counter, and the conditional branch counter, at most trace events, i.e. signals, context switches, and system calls that don't take the syscall-buffering fast path. We typically detect divergence bugs when replay reaches the execution point of the next event but the replay register/counter state doesn't match the recorded state. This includes cases where a different system call is performed, and most cases where behavior changes during replay produce writes to stdout/stderr that differ from those during recording. Here's an example:
[ERROR /home/kfischer/rr-vanilla/src/Registers.cc:303:maybe_print_reg_mismatch() errno: SUCCESS] rdx 0 != 0xff (replaying vs. recorded) [FATAL /home/kfischer/rr-vanilla/src/Registers.cc:423:compare_register_files() errno: SUCCESS] (task 24036 (rec:24003) at time 1132365) -> Assertion `!bail_error || match' failed to hold. Fatal register mismatch (ticks/rec:4473519809/4473519809) Launch gdb with gdb /home/kfischer/.local/share/rr/latest-trace/mmap_hardlink_23_wine64 and attach to the rr debug server with: target remote :24036
In rr, most divergences are detected pretty soon after they are caused. In practice almost any change in control flow causes the conditional-branch counter to diverge. So the first thing I do is look for clues that something unusual happened just before the divergence was detected. One way to do that is to look at the trace leading up to the current point. If a rare system call or signal happened just before we diverged, it's worth looking closely at our handling of that! Another way is to use the emergency debugger mentioned above. This provides a gdbserver interface so you can inspect the state of the diverged process with gdb. Looking at the call stack can show that the process was doing something unusual that needs to be checked, especially if you have application source. Surprisingly often, with this data, I'm able to make some inspired guesses and write a standalone testcase that reproduces the problem, and from there it's usually easy.
When we have a divergence in one or a few register values, such as the issue linked above, that almost always means control flow did not diverge and we can make progress by reasoning backwards to figure out where that bad value must have come from. As you'd hope, rr itself is very useful for this; you can replay right up to the divergence point and reverse-stepi to see where the bad value is set. If reverse-execution is busted for some reason, we can tell rr to singlestep over some region of the trace and dump all register values at each step --- slow, but usually gives you the information you need. These techniques don't work if control flow diverged.
For harder bugs we generally need to be able to reproduce the bug by re-recording instead of just working off a single recorded trace, if we're going to make any progress. Re-recording with the system-call buffering optimization disabled is very helpful; it's a lot slower, but we'll check registers/counters at every single system call and catch divergences earlier. If the bug disappears entirely, we can safely bet it's in the syscall-buffering logic itself and we can easily narrow down which syscall wrapper is responsible. A similar but more aggressive approach which I've used in some desperate situations is to set the scheduler to context-switch every small number of conditional branches, and suppress noop-reschedule elision, so the trace records the register state very frequently ... effective, but often impractically slow.
Another tool in our toolbox is memory checksumming. To catch bugs where a memory location diverges and we don't catch the divergence until much later, rr has command-line options to periodically checksum memory regions and include those checksums in the trace. During replay we check the checksums to see if memory has diverged. For when you're close to the target, you can also dump actual memory contents and compare them. I don't use this facility very much, but when you need it, you really need it.
For bugs where we know we've recorded the correct data but replay goes wrong, we have the ability to record rr replay under rr (and, obviously, replay it), so we can actually use reverse-execution debugging to debug rr's replay code. I haven't used this a lot yet, since I got it working relatively recently, and sometimes it falls over ("and then you have two problems"), but when it works you get the feeling of lightning shooting out of your fingertips.
One general rule that I learned at Mozilla is to be very aggressive about using assertions to check invariants. This helps detect bugs closer to the root cause. Generally rr has assertions enabled all the time; the performance impact is not usually an issue since rr's efficiency (apart from the syscallbuf code) mostly comes down to switching from the application to rr as infrequently as possible. If we're running rr code, we're already losing.
Some of the hardest divergence bugs we've had in rr have been misbehavior in the hardware conditional-branch counter, either induced by OS/VM issues or by actual hardware bugs. These are agony. Definitely study Intel's errata sheets, read kernel source, and test carefully across different CPU/VM configurations.
I hope this helps! I'm always glad to answer questions about rr :-).