Monday 8 September 2014
VMWare CPUID Conditional Branch Performance Counter Bug
This post will be uninteresting to almost everyone. I'm putting it out as a matter of record; maybe someone will find it useful.
While getting rr working in VMWare guests, we ran into a tricky little bug. Typical usage of CPUID. e.g. to detect SSE2 support, looks like this pseudocode:
CPUID(0); // get maximum supported CPUID subfunction M
if (S <= M) { 
  CPUID(S); // execute subfunction S
}
Thus, CPUID calls often occur in pairs with a conditional branch between them. The bug is that in a VMWare guest, when we count the number of conditional branches executed, the conditional branch between those two CPUIDs is usually (but not always) omitted from the count. We assume this is a VMWare bug because it does not happen on the same hardware outside of a VM, and it does not happen in a KVM-based VM.
Experiments show that some code sequences trigger the bug and other equivalent sequences don't. Single-stepping and other kinds of interference suppress the bug. My best guess is that VMWare optimizes some forms of the above code, perhaps to reduce the number of VM exits, and in so doing skips execution of the conditional branch, without taking into account that this might perturb performance counter values. Admittedly, it's unusual for software to rely on precise performance counter values the way rr does.
This sucks for rr because rr relies on these counts being accurate. We sometimes find that replay diverges because one of these conditional branches was not counted during recording but is counted during replay. (The other way around is possible too, but less frequently observed.) We have some heuristics and workarounds, but it's difficult to fully work around without adding significant complexity and/or slowdown.
The bug is easily reproduced: just use rr to record and replay anything simple. When replaying, rr automatically detects the presence of the bug and prints a warning on the console:
rr: Warning: You appear to be running in a VMWare guest with a bug
    where a conditional branch instruction between two CPUID instructions
    sometimes fails to be counted by the conditional branch performance
    counter. Partial workarounds have been enabled but replay may diverge.
    Consider running rr not in a VMWare guest.
Steps forward:
- Find a way to report this bug to VMWare.
- Linux hosts can run rr in KVM-based VMs or directly on the host. Xen VMs might work too.
- Parallels apparently supports PMU virtualization now; if Parallels doesn't have this bug, it might be the best way to run rr on a Mac or Windows host.
- We can add a "careful mode" that would probably almost always replay successfully, albeit with additional overhead.
- The bug is less likely to show up once rr supports x86-64. At least in Firefox, CPUID instructions are most commonly used to detect the presence of SSE2, which is unnecessary on x86-64.
- In practice, recording Firefox in VMWare generally works well without hitting this bug, so maybe we don't need to invest a lot in fixing it.
Comments