Monday 30 November 2020
rr remix: Efficient Replay-Only Binary Instrumentation
The Pernosco omniscient debugger analyzes user-submitted rr recordings to extract all program states into a database, so that during debugging sessions it can efficiently recreate the program state at any point in time. This analysis requires intensive binary instrumentation of rr replays (e.g. to observe all memory writes performed by the application). To integrate binary instrumentation with rr replay we had to create a new binary instrumentation framework, which we call "rr remix". (Record, replay, remix, ...)
Our main goals for remix were:
- Instrument the replay of any rr recording (including applications with self-modifying code)
- Support executing arbitrary instrumentation code
- Minimise space and time overhead, especially on huge applications (e.g. Web browsers)
- Be efficient on code without compiler optimizations, as well as optimized code (the former being common when debugging)
- Be simple so we could build and maintain it with minimal effort
- Support replaying rr recordings in hardware/VMs without access to hardware performance counters or CPUID faulting, and with incompatible XSAVE layouts
Traditionally, tool performance has been mostly evaluated on small, compiler-optimized benchmark applications. Therefore our goals led us to make rather different design decisions compared to other well-known binary instrumentation tools such as Pin, Valgrind, and DynamoRIO. Also, doing binary instrumentation during rr replay imposes some extra requirements on the instrumentation engine, but also gives us very valuable knowledge of the future that we can exploit to simplify the design of the instrumentation engine and improve its performance.
To illustrate the performance of remix we will show the performance of clang++ compiling a simple C++ program. This resembles running a single test of a large application, typical behaviour for Pernosco users.
We compare up-to-date DynamoRio, Valgrind and remix all using "null tool" instrumentation, i.e. rewriting the code but not actually adding any unnecessary instrumentation. These are geometric means of five runs, real time. On optimized clang++, remix of an rr replay beats both DynamoRio and Valgrind instrumenting a normal run.
On non-optimized code, remix beats both DynamoRio and Valgrind again, but the overhead ratios of all the instrumentation systems (especially DR/Valgrind) are lower. This is a pretty good result for remix because it is much simpler than DR and Valgrind; in particular it doesn't perform any inlining, while the others do. This hurts more on non-optimized code, which has a lot more function calls that compilers would normally inline. (Valgrind deliberately optimizes for ease of tool writing and portability over performance; I'm including it for completeness and because people are familiar with it.)
Unfortunately we aren't in a position to open-source remix at this time.
We plan to follow up with some more posts documenting interesting design decisions in remix and how they contribute to these results. Probable topics:
- The basic remix architecture and how it integrates into rr
- Fixing regular rr's limitations on trace portability and target hardware
- Leveraging knowledge of the future to improve the efficiency of binary rewriting
- The mystery of efficient branch-and-link instructions on x86-64
- Optimizing non-optimized code: leveraging hardware return address prediction in binary instrumentation
- Optimizing non-optimized code: dataflow analysis
PS, remember this is all in service of the Pernosco omniscient debugger — try it out!