Monday 14 November 2016
Handling Hardware Lock Elision In rr
Intel's Hardware Lock Elision feature lets you annotate instructions with prefixes to indicate that they perform lock/unlock operations. The CPU then turns those into hardware memory transactions so that the instructions in the locked region are performed speculatively and only committed at the unlock. The difference between HLE and the more capable RTM transactional memory support is that HLE is supposed to be fully transparent. The prefixes are ignored on non-HLE-supporting CPUs so you can just add them to your code and things will hopefully get faster --- no CPUID checks are necessary. Unfortunately, by default, Intel's hardware performance counters count events in aborted transactions, even though they didn't really happen in terms of user-space effects. Thus when rr records a program that uses HLE, our conditional branch counter may report a value higher than the number of conditional branches that "really" executed, and this breaks rr. (FWIW we discovered this problem when Emilio was using rr to debug intermittent failures in Servo using the latest version of the Rust parking_lot crate.)
For RTM we have some short-term hacks to disable RTM usage in glibc, and the medium-term solution is to use "CPUID faulting" to trap CPUID and modify the feature bits to pretend RTM is not supported. This approach doesn't work for HLE because there is no need to check CPUID before using it.
Fortunately Intel provides an IN_TXCP flag that you can set on a hardware performance counter to indicate that it should not count events in aborted transactions. This is exactly what we need. However, for replay we need to be able to program the PMU to send an interrupt after a certain number of events have occurred, and the Linux kernel prevents us from doing that for IN_TXCP counters. Apparently that's because if you request an interrupt after a small number of events and then execute an HLE transaction that generates more than that number of events, the CPU will detect the overflow, abort the transaction, roll the counter back to its pre-transaction value, then the kernel notices there wasn't really an overflow, restarts the transaction, and you're in an infinite loop.
The solution to our dilemma is to use two counters to count conditional branches. One counter is used to generate interrupts, and it is allowed to count events in aborted transactions. Another counter uses IN_TXCP to avoid counting events in aborted transactions, and we use this counter only for measurement, never for generating interrupts. This setup works well. It means that during replay our interrupt might fire early, because the interrupt counter counted events in aborted transactions, but that's OK because we already have a mechanism to carefully step forward to the correct stopping point.
There is one more wrinkle. While testing this new approach I noticed that there are some cases where the IN_TXCP counter reports spurious events. This is obviously a nasty little bug in the hardware, or possibly the kernel. On my system you can reproduce it just by running perf stat -e r5101c4 -e r2005101c4 ls --- the second event is just the IN_TXCP version of the first event (retired conditional branches), so should always report counts less than or equal to the first event, but I get results like
Performance counter stats for 'ls': 1,994,374 r5101c4 1,994,382 r2005101c4I have a much simpler testcase than ls which I'll try to get someone at Intel to look at. For now, we're working around it in rr by using the results of the regular counter when the IN_TXCP counter's value is larger. This should work as long as an IN_TXCP overcount doesn't occur in an execution sequence that also uses HLE, and both of those are hopefully rare.