Monday, 29 September 2014

Upcoming rr Talk

Currently I'm in the middle of a 3-week visit to North America. Last week I was at a Mozilla graphics-team work week in Toronto. This week I'm mostly on vacation, but I'm scheduled to give a talk at MIT this Thursday about rr. This is a talk about the design of rr and how it compares to other approaches. I'll make the content of that talk available on the Web in some form as well. Next week I'm also mostly on vacation but will be in Mountain View for a couple of days for a planning meeting. Fun times!

Tuesday, 9 September 2014

rr 2.0 Released

Thanks to the hard work of our contributors, rr 2.0 has been released. It has many improvements over our 1.0 release:

  • gdb's checkpoint, restart and delete checkpoint commands are supported.
    These are implemented using new infrastructure in rr 2.0 for fast cloning of replay sessions.
  • You can now run debuggee functions from gdb during replay.
    This is a big feature for rr, since normally a record-and-replay debugger will only replay what happened during recording --- and of course, function calls from gdb did not happen during recording. So under the hood, rr 2.0 introduces "diversion sessions", which run arbitrary code instead of following a replay. When you run a debuggee function from gdb, we clone the current replay session to a diversion session, run your requested function, then destroy the diversion and resume the replay.
  • Issues involving Haswell have been fixed. rr now runs reliably on Intel CPU families from Westmere to Haswell.
  • Support for running rr in a VM has been improved. Due to a VMWare bug, rr is not as reliable in VMWare guests as in other configurations, but in practice it still works well.
  • Trace compression has been implemented, with compression ratios of 5-40x depending on workload, dramatically reducing rr's storage and I/O usage.
  • Many many bugs have been fixed to improve reliability and enable rr to handle more diverse workloads.

All the features normally available from gdb now work with rr, making this an important milestone.

The ability to run debuggee functions makes it much easier to use rr to debug Firefox. For example you can dump DOM, frame and layer trees at any point during replay. You can debug Javascript to some extent by calling JS engine helpers such as DumpJSStack(). Some Mozilla developers have successfully used rr to fix real bugs. I use it for most of my Gecko debugging --- the first of my research projects that I've actually wanted to use :-).

Stephen Kitt has packaged rr for Debian.

Considerable progress has been made towards x86-64 support, but it's not ready yet. We expect x86-64 support to be the next milestone.

I recorded a screencast showing a quick demo of rr on Firefox:

Monday, 8 September 2014

VMWare CPUID Conditional Branch Performance Counter Bug

This post will be uninteresting to almost everyone. I'm putting it out as a matter of record; maybe someone will find it useful.

While getting rr working in VMWare guests, we ran into a tricky little bug. Typical usage of CPUID. e.g. to detect SSE2 support, looks like this pseudocode:

CPUID(0); // get maximum supported CPUID subfunction M
if (S <= M) { 
  CPUID(S); // execute subfunction S
Thus, CPUID calls often occur in pairs with a conditional branch between them. The bug is that in a VMWare guest, when we count the number of conditional branches executed, the conditional branch between those two CPUIDs is usually (but not always) omitted from the count. We assume this is a VMWare bug because it does not happen on the same hardware outside of a VM, and it does not happen in a KVM-based VM.

Experiments show that some code sequences trigger the bug and other equivalent sequences don't. Single-stepping and other kinds of interference suppress the bug. My best guess is that VMWare optimizes some forms of the above code, perhaps to reduce the number of VM exits, and in so doing skips execution of the conditional branch, without taking into account that this might perturb performance counter values. Admittedly, it's unusual for software to rely on precise performance counter values the way rr does.

This sucks for rr because rr relies on these counts being accurate. We sometimes find that replay diverges because one of these conditional branches was not counted during recording but is counted during replay. (The other way around is possible too, but less frequently observed.) We have some heuristics and workarounds, but it's difficult to fully work around without adding significant complexity and/or slowdown.

The bug is easily reproduced: just use rr to record and replay anything simple. When replaying, rr automatically detects the presence of the bug and prints a warning on the console:

rr: Warning: You appear to be running in a VMWare guest with a bug
    where a conditional branch instruction between two CPUID instructions
    sometimes fails to be counted by the conditional branch performance
    counter. Partial workarounds have been enabled but replay may diverge.
    Consider running rr not in a VMWare guest.

Steps forward:

  • Find a way to report this bug to VMWare.
  • Linux hosts can run rr in KVM-based VMs or directly on the host. Xen VMs might work too.
  • Parallels apparently supports PMU virtualization now; if Parallels doesn't have this bug, it might be the best way to run rr on a Mac or Windows host.
  • We can add a "careful mode" that would probably almost always replay successfully, albeit with additional overhead.
  • The bug is less likely to show up once rr supports x86-64. At least in Firefox, CPUID instructions are most commonly used to detect the presence of SSE2, which is unnecessary on x86-64.
  • In practice, recording Firefox in VMWare generally works well without hitting this bug, so maybe we don't need to invest a lot in fixing it.