Thursday 31 March 2016
Using rr To Debug rr
Working on rr has often been frustrating because although rr is immensely useful for understanding bugs, it often has complex bugs of its own and you can't use rr to debug them. You sometimes feel you're in debugging hell so that others can go to heaven.
I realized it might not be too hard to get rr to record and replay rr doing a replay. rr's behavior during replay is much simpler than during recording, partly because most kernel state does not exist during replay (file descriptors aren't really open, etc). Achieving this could be valuable because most of the work I plan to do on rr involves doing more interesting things during replay. So I've spent about a week working on that --- and it works! On master you can run:
rr record ~/rr/obj/bin/simple rr record rr replay -a ~/.local/share/rr/simple-0 rr replay -aThe final line is rr replaying rr replaying simple. I've added a test for this.
It's about 2,000 lines of new code, much of which is tests exercising the new ptrace features I had to implement. Fleshing out support for ptrace (including PTRACE_SYSCALL, PTRACE_SYSEMU, PTRACE_SINGLESTEP etc) was the majority of the work. (A process can't have two ptracers at the same time, so the outer rr must ptrace all the processes and the inner rr's use of ptrace has to be completely emulated by the outer rr.) Along the way I found and fixed some interesting existing rr bugs. In some cases I simplified rr replay to avoid use of ptrace features that it didn't really need (e.g. PTRACE_O_TRACEEXEC).
Debugging rr running under rr can be mind-bending. It's easy to get confused about which rr you're looking at, and it's hard to examine the behavior of the inner rr when you can't trust the outer rr. I hope I don't have to do much more of it!
As you'd expect, adding a level of record and replay makes things much slower.
As you'd expect, I tried rr recording rr replaying rr replaying simple. As you'd expect, it didn't work first time. It's probably a fixable bug but I have no desire to go deeper down the rabbit-hole at the moment.
An obvious question is whether rr could record rr recording as well as rr replay. In principle I think that could work, but it would not be very useful. If rr recording of X fails, then rr recording rr recording X is likely to experience the same fault in the outer rr (because it must record the inner rr and X), and be much more difficult to debug! For debugging rr recording really well we'd need a record and replay system built on an entirely different technology, e.g. Simics.