Wednesday 2 December 2020
When rr replays a recording, it constructs processes that have identical memory and register contents to the recorded processes. It replays execution of the threads in those processes in steps; each step runs CPU code until some specific program state is reached (e.g., the next system call, or until registers and the retired-conditional-branch counter match some recorded values). The most efficient way to implement binary instrumentation of this code is to inject the instrumentation engine into each replay process so the engine and its generated code share the same address space as the application code and data. Then, instead of rr replay using PTRACE_CONT to directly execute application code, it uses PTRACE_CONT to enter the instrumentation engine, which is then responsible for executing the application code with instrumentation. When the engine detects that the instrumented code has reached the desired stopping point for that replay step, it returns control to rr replay.
A key invariant is that the remix engine produces the same effects as native execution on application memory and registers at the end of each replay step. This ensures that rr replay continues to produce memory and register states that match those during recording. It also means we can switch between instrumented execution and native execution at any time between replay steps, e.g. we can replay up to a certain point using regular rr replay and then turn on instrumentation. This is useful for debugging the instrumentation engine and for applying instrumentation-based analysis to a subinterval of an rr recording. In particular Pernosco uses this to parallelize analysis by running multiple replays at once, each one instrumenting a different time interval in the recording.
When injecting the instrumentation engine into each replay process we need to allocate a contiguous range of virtual memory that will never be used by the application. Fortunately, because this is an rr replay, we can see into the future. We can quickly scan the recording, identify a sufficiently large range of memory that will never be used in any replay process, and place the engine and its data there in all replay processes from the beginning.
rr replay needs to count the number of retired conditional branches so that we can deliver asynchronous interrupts at the right time during program execution. Effectively, the RCB counter is part of the state that we instruct the engine to stop at. To avoid having to stop and start a hardware performance counter around the instrumentation's own conditional branches, the engine disregards hardware counters and instead adds instructions to count the conditional branches explicitly.
Like other instrumentation engines, remix processes a group of instructions at a time, translating each group of application instructions into instrumented instructions in a hidden code buffer; we call these groups "fragments". This lets us apply optimizations across instruction boundaries within a fragment. For example, the shortest instruction to increment our conditional branch counter is a single inc instruction, but this instruction modifies the CPU's arithmetic flags, which could disrupt the application. However, it is safe to modify flags if we can guarantee that the inc will always be followed by an application instruction that overwrites those flags without reading them. Conditional branches are often preceded by such instructions. Therefore, for example, consider the following application instructions:
cmp r12,[rsp-8] jz labelBecause cmp overwrites the arithmetic flags, we can translate this code to
inc [remix_rcb_counter] cmp r12,[rsp-8] jz translated_label
Correctly applying this kind of optimization in a binary instrumentation engine is more difficult than it looks, because of unexpected early exits from fragments. In this case, a problem would arise if rsp-8 is not a valid address so that instruction triggers a segmentation fault. We would fault after incrementing remix_rcb_counter, counting a conditional branch that may never happen. Even worse, we will have corrupted the application flag values; the cmp instruction we were counting on to cover that up has not executed, and may never execute! (Keep in mind that segfaults don't have to be fatal...) Normally, the possibility of these unexpected exits limits the optimizations an engine can use, and/or requires elaborate recovery machinery to mitigate — machinery that adds overhead to the just-in-time binary instrumentation process.
During remix execution, however, rr replay indicates whether execution will stop at a segmentation fault or not. If it will stop at a fault, the fault state will be the goal state for that execution step, and remix will insert code before the faulting instruction to stop execution when the right state has been reached — it will never execute a faulting instruction. In remix there are no early exits from fragments; once a fragment has been entered, it is guaranteed to run to completion. We can apply code motion within a fragment at will as long as dataflow dependencies are respected. Binary instrumentation of rr replays is a much easier problem than regular just-in-time binary instrumentation and this lets remix achieve lower overhead with a simpler implementation.