One of the painful truths about Linux development is that debugging sucks. Specifically, gdb on Linux sucks. Basic functionality simply does not work reliably. The details of the brokenness vary depending on the version, but the general suckiness seems to hold up across versions. Since it fails on the basics, we need not even discuss the lack of modern features or its appalling usability. This is a big problem for Linux because it is driving developers away from the platform.
Here's a deeper and less widely understood truth: all debuggers suck. Debugging is mainly about tracing bug symptoms back to buggy code, i.e. tracing effects to causes. Effects follow their causes in time, so debugging by nature requires backwards traversal of time. But traditional debuggers don't provide this because it's not obvious how to implement it. Instead they let the programmer monitor a program's forward execution, in the hope that the programmer will be able to somehow remember or reconstruct enough history, usually by running the program several times, to figure out for themselves the backwards cause and effect chain. Such tools aren't particularly suited to the debugging task; they're just a compromise forced by either limited imagination or implementation constraints.
Lots of other people have made this observation and have tried to extend traditional debuggers with some support for looking backwards in time. For example:
- SoftICE supports a "back trace buffer" that records a limited number of instruction executions in a certain instruction address range. Only the program counter values are recorded.
- Green Hills' "Time Machine" debugger also has a trace buffer, which apparently can include register and memory deltas so that old memory and register values can be restored by "backwards stepping" through the buffer. Unfortunately it seems this buffer is limited to memory, and therefore for a large application only a small history window can be retained. Furthermore Green Hills' implementation requires specialized hardware if you want to trace more than program counter values.
- TTVM logs the interactions between a guest virtual machine and its host so that later the exact execution of the guest VM can be replayed. Periodic checkpoints are recorded so that the replay cost required to reach any desired point in time is limited.
In Time Machine and TTVM, the UI is a traditional debugger UI extended with "reverse" versions of the standard forward execution commands. The debugger still displays the program state at a certain point in time; there is now some capability to move the current point backwards as well as forwards.
This is good progress --- but those tools still need to overcome implementation limitations to become immediately useful for mainstream desktop application debugging (i.e. useful to me!). More importantly, I want more than just reverse execution. For example, when debugging we frequently want to know exactly when a variable was set to its current (incorrect) value, and see what the program was doing at that time. Reverse-execution debuggers can set a watchpoint on the variable and execute backwards to find that write, but it would be quicker and more powerful if the debugger just knew the history of the variable and could perform an efficient database lookup to find the last time the variable was written before the current time. The debugger should have the entire history of the program available for efficient inspection and querying. With such a debugger, after we've gathered the history we don't need to run the program any more; the user just browses the history in the debugger until the bug is located.
The only other exposition of this vision I know of is Omnisicient Debugging. Bil Lewis has created a Java debugger that takes this approach. Unfortunately it is restricted to Java programs, and apparently rather small and short-running Java programs since it keeps the trace data in memory. So the big question is: can we implement this vision so that it's a practical approach for debugging real applications on commodity hardware?
I have spent a few of the last 18 months investigating this question. I have built a system, which I'm calling Amber, to record the complete execution history of arbitrary Linux processes. The history is recorded using binary instrumentation based on Valgrind. The history is indexed to support efficient queries that debuggers need, and then compressed and written to disk in a format optimized for later query and retrieval. The history supports efficient reconstruction of the contents of any memory location or register at any point in time. It also supports efficient answers to "when was the last write to location X before time T", "when was location P executed between times T1 and T2", and other kinds of queries. I can record the 4.1 billion instructions of a Firefox debug build starting up, displaying a Web page, and exiting; the compressed, indexed trace is about 0.83 bytes per instruction executed. (With some cleverness, we achieve compression ratios of over 20 to 1.) It takes less than half an hour to record on my laptop. 300X slowdown may sound like a lot, but spending computer time to save programmer time is almost always a good idea. Furthermore Amber tracing easily parallelizes across multiple cores so hardware trends are in my favour.
I've built a trace query engine in C and a prototype debugger front end using XULRunner to demonstrate the power of this approach. It works as expected. I've found that basing the debugger entirely on a "complete recording" approach dramatically simplifies some of the difficult parts of debugger implementation. Novell has filed for a patent on my work (and is planning to file another regarding an extension which I haven't described here yet), but I hope to get Novell to release the source code under an open source license so I can keep working on it. Maybe others will be interested in working on it too.
When I get a chance I'll write more about how Amber works, its full set of features, and what needs to be done to get it to prime-time.