Monday 13 July 2015
rr 4.0 will be the first rr release where reverse execution is reliable. I want to release it soon, but some users still find situations where reverse executions that seems to hang. I've implemented a mitigation by allowing ctrl-C to interrupt reverse execution (though I think there are still bugs where sometimes it fails to interrupt cleanly), but there are also underlying issues that can cause reverse execution to be very slow, and I want to fix as many of those as I can. I've recently fixed the two major fixable issues that I currently know of, and they're interesting.
For the following discussion, keep in mind that we implement reverse execution by restoring the program state to a previous checkpoint, then executing forwards and noting which breakpoints/watchpoints get hit. Once we reach our original starting point, we know which breakpoint/watchpoint firing is the "next" one in the reverse direction; we restore the program state to a previous checkpoint again, and execute forward again until we reach that target breakpoint/watchpoint.
The first issue involves reverse execution over conditional breakpoints/watchpoints that are hit frequently but whose condition almost always returns false. Following the above procedure, rr would reverse-continue to the most recent triggering of the breakpoint (or watchpoint), then stop and signal gdb; gdb would read program state via rr to evaluate the breakpoint, and then signal rr to continue with reverse execution. Thus, every time the breakpoint is hit, rr must perform at least two clonings of checkpoint state plus some amount of forward execution. That makes reverse execution unbearably slow if the breakpoint is hit hundreds of times.
Fortunately, gdb has an obscure feature that lets it offload evaluation of simple conditions to the remote target (in this case, rr) by expressing them as bytecode. So we add to rr a little interpreter for gdb's bytecode; then rr can implement reverse execution by executing forwards over some time interval, and every time a breakpoint is hit, evaluating its condition and ignoring the hit if the condition evaluates to false. If all goes well we only have to reconstruct program state for debugger stops where the user actually wanted to stop.
Unfortunately not all goes well. Some breakpoint expressions, e.g. those involving function calls, are too complex for gdb to express with its bytecode; in such cases, pathological slowdown of reverse execution is inevitable. So try to avoid using complex breakpoint conditions on breakpoints that are frequently hit during reverse execution. Moreover it turns out gdb has a bad bug in code generation, which requires an even worse workaround. I submitted a gdb fix which has landed upstream so hopefully one day we can get rid of the workaround.
Another case which is difficult to deal with is an unconditional breakpoint being hit very many times during the interval we're reverse-executing over. In this case just executing forward and stopping thousands times to find the last hit can be prohibitively expensive. Ideally we could turn off breakpoints, run forward until we're close to where reverse execution started, then enable breakpoints and continue forward again, hopefully only hitting a few occurrences of the breakpoint before we reach the last one. We can approximate this by detecting that we're hitting "too many" breakpoints while executing forward, and responding by cutting the remaining execution interval roughly in half, advancing to the start of the second half with breakpoints disabled, and resuming normal forward execution with breakpoints enabled --- possibly having to repeat the subdivision process if we're still in dense region of breakpoint hits.
I just landed these fixes on master. I'm very interested in hearing about any remaining cases where reverse execution takes a pathologically long time.