Friday, 25 May 2018

Intel CPU Bug Affecting rr Watchpoints

I investigated an rr bug report and discovered an annoying Intel CPU bug that affects rr replay using data watchpoints. It doesn't seem to be hit very often in practice, which is good because I don't know any way to work around it. It turns out that the bug is probably covered by an existing Intel erratum for Skylake and Kaby Lake (and probably later generations, but I'm not sure), which I even blogged about previously! However, the erratum does not mention watchpoints and the bug I've found definitely depends on data watchpoints being set.

I was able to write a stand-alone testcase to characterize the bug. The issue seems to be that if a rep stos (and probably rep movs) instruction writes between 1 and 64 bytes (inclusive), and you have a read or write watchpoint in the range [64, 128) bytes from the start of the writes (i.e., not triggered by the instruction), then one spurious retired conditional branch is (usually) counted. The alignment of the writes does not matter, and it's not related to speculative execution.

If you find rr failing during replay with watchpoints set, and the failures go away if you remove the watchpoints, it could well be this bug. Broadwell and earlier don't seem to have the bug.

A possible workaround would be to disable "fast-string optimization" in the kernel at boot time. I don't think there's any way for users to force this to happen in Linux, currently, but someone could write a kernel patch adding a command-line option for that and send it upstream. It would be great if they did!

Fortunately this sort of bug does not affect Pernosco.

Update: Pernosco

In the two years since I left Mozilla I've been working on a new debugger (with Kyle Huey, most of that time). We call it Pernosco — "know thoroughly". Pernosco takes rr recordings of failing runs, analyzes them in the cloud, and provides a Web interface for developers to debug them. It uses components of rr but the debugger implementation is completely different. Pernosco's data-oriented approach has features and performance that rr's approach cannot match — and nor can any other debugger. We've reached the point where external users have used Pernosco to fix real bugs, and their response has been very enthusiastic. It's exciting!

Developers spend a lot of time figuring out bugs. We think Pernosco will create a lot of value for software development organizations. We intend to capture some of that value by offering Pernosco as a paid cloud service. We have many ideas for making debugging easier and more fun, and we need a sustainable business model to make them happen. It's especially important because debugging is an area that has been chronically under-invested in across the industry.

Saturday, 12 May 2018

rr Chaos Mode Improvements

rr's chaos mode introduces nondeterminism while recording application execution, to try to make intermittent bugs more reproducible. I'm always interested in hearing about bugs that cannot be reproduced under chaos mode, especially if those bugs have been diagnosed. If we can figure out why a bug was not reproducible under chaos mode, we can often extend chaos mode to make it reproducible, and this improves chaos mode for everyone. If you encounter such a bug, please file an rr issue about it.

I just landed one such improvement. To trigger a specific Spidermonkey JS engine bug, some thread X had to do a FUTEX_WAKE to wake up thread Y, then immediately yield to let thread Y run for a while without X running any further. rr chaos mode assigns random priorities to threads and strictly adheres to them, so in some runs it would assign X a low priority and Y a high priority and schedule Y whenever both were runnable. However, rr's syscall buffering optimization means the rr supervisor process is not notified after the FUTEX_WAKE and has no opportunity to interrupt X and schedule Y instead, so we keep running the lower-priority X thread, violating our scheduling policy. (Chaos mode randomizes scheduling intervals so it was possible for X to yield at the right point, but very unlikely because the "window of vulnerability" is very small.) The fix is quite easy: in chaos mode, FUTEX_WAKE should not use the syscall buffering optimization. This adds some overhead, but hopefully not all that much, because every FUTEX_WAKE is normally paired with a FUTEX_WAIT (futex-using code should not issue a FUTEX_WAKE if there are no waiters), and a FUTEX_WAIT yields, which is already an expensive operation.

The same sorts of issues exist for other system calls that can make another higher-priority thread runnable, and I've added some slightly more elaborate fixes for those.

One day I should do a proper evaluation of these techniques and publish them...

Wednesday, 9 May 2018

Research Wishlist: A Filesystem For Efficient Host-Guest File Sharing

Suppose you have a guest virtual machine and you'd like to give it read-only access to a set of (large) files created and managed in the host OS (or possibly a different guest VM). Current options for this are inefficient. You can use a network filesystem like 9pfs or NFS to share the files, but that means every read copies the data from host to guest, which means the data exists in two places in system RAM, and throughput is limited. Latency is also bad in practice. Alternatively you can copy the files into a filesystem on the guest, after which access is efficient, but the copy itself is slow especially if the guest won't read all the data, plus of course you have to manage replication and you need twice the persistent storage.

It seems to me a custom-designed host-guest filesystem could share data pages between host and guest, allowing very efficient I/O with no extra copies or memory footprint, as well as fully supporting mmap(), even for writing (i.e., you can have shared-memory communication between applications in different VMs). I imagine the host kernel would mmap() files into guest physical memory and the guest filesystem driver would use those pages as the data pages for its view of the files, similar to the way persistent memory storage devices work. Non-read/write operations such as open, close, directory reading, etc, could be hypercalls from the guest into the host, which I imagine can be quite fast. With this approach, guest reads and writes would have minimal added latency over host reads and writes: for guest-cached pages, none at all, and for guest-cache misses, just the cost of nested page faults.

Even a read-only version of this would be very useful to us. I'm surprised that it doesn't already exist. Maybe it does and it's just not well known? If it doesn't exist, it seems to me this would be an excellent grad-student OS research project. It doesn't sound very difficult. Stanford's Disco project had something like this 20 years ago.

Update Kata Containers (aka Intel Clear Containers) describes something pretty close to what I want, using DAX and QEMU's nvdimm virtualization to map some host files into the guest. But it sounds like you can only map a filesystem image file into the guest, rather than accessing host files on a file by file basis, and then it wouldn't be possible to update files from the host.

Wednesday, 2 May 2018

Priority Is Overrated

We give scientists great credit for priority, i.e. being the first to discover something important — often far too much, because priority is often a matter of happenstance and does not reflect the ability of the discoverer.

Let me use my own work as an example (neatly projecting false humility while being self-promoting). Probably my most influential paper was the PPoPP '03 paper I wrote with Jong-Deok Choi on hybrid dynamic data race detection — 400+ citations so far. (Incidentally that paper was rejected by PLDI, a good reminder that publication reviews are a bit of a lottery.) The key idea was to combine two different existing dynamic race detection methods, lockset and happens-before, applying the expensive happens-before detection to a "synchronizing" subset of memory accesses. AFAIK we were the first to publish that idea, and it has been used by most practical race detectors since. Our paper is the top hit when I search for "hybrid race detection". Does describing it first imply that Jong and I were especially talented? Absolutely not.

For one thing, a very similar approach combining happens-before checks with lock-based analysis was present in the paper "Detecting Access Anomalies In Programs With Critical Sections" by Dinning and Schonberg in 1991 (which we cited). Our work was new and good, but it wasn't unprecedented.

More importantly, the whole field of dynamic data race detection was picking up steam. Both dynamic happens-before and lockset algorithms were prominent and it was only a matter of time before someone combined them. If we hadn't published our work, others would certainly soon have published something similar. We happened to be in the right place at the right time.

One could argue that "being in the right place at the right time" is a very important skill, and that our paper is evidence Jong and I were good at selecting timely problems that would let us write influential papers. That's a useful skill if your goal is to write papers and don't particularly care what they're about. It's less useful if you have a specific problem to solve.

This matters when you need to evaluate the ability of researchers. When making hiring or funding decisions with some specific problem in mind, be careful not to overweight "first to discover X" facts as a predictor of future success. Ditto if you're looking for teaching or advice.

Tuesday, 1 May 2018

rr Trace Portability: x87 "Data Pointer" Broken On Skylake

x87 math instructions have an odd feature where the address of the last memory operand of an x87 instruction is saved in an "FPU Data Pointer" register. This value is stored into memory by an FXSAVE, XSAVE or similar instruction. (For some instructions only the low 32 bits are stored.) On a Broadwell machine this works as expected, but on my Skylake laptop the value is zero even immediately after executing a simple FLD instruction. This program shows the bug:

#include <stdio.h>
#include <stdint.h>
static char buf[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
static char fxsave_buf[512];
int main() {
  int ebx;
  asm("cpuid" : "=b"(ebx) : "a"(7), "c"(0));
  printf("cpuid(7,0) ebx=0x%x; FDP_EXCPTN_ONLY=%s\n",
         ebx, (ebx & (1 << 6)) ? "true" : "false");
  asm("fld (%%rbx)\n"
      "fxsave (%%rcx)\n"
      : : "b"(buf), "c"(fxsave_buf));
  uint32_t fdp = *(uint32_t*)(fxsave_buf + 16);
  printf("fdp = 0x%x, buf addr = %p; %s\n",
         fdp, buf,
         fdp == (uint32_t)(uintptr_t)buf ? "OK" : "ERR");
  return 0;

A mysterious barely-documented CPUID feature is related. EAX=7, ECX=0:

EBX Bit 06: FDP_EXCPTN_ONLY. x87 FPU Data Pointer updated only on x87 exceptions if 1.
That bit is 0 on my Skylake machines, but Intel erratum SKL087 explains that Skylake behaves as if this bit was 1. I haven't got newer CPUs to test on but I suspect the FDP_EXCPTN_ONLY bit is 1 on newer CPUs, i.e. Intel just codified the Skylake bug as expected behaviour.

This is a problem for rr trace portability, since when FXSAVE or some XSAVE variant are used (and on x86-64 the glibc dynamic loader always uses one of them), if the FDP is different between recording and replay this difference can leak into stack memory and then into uninitialized stack locations loaded into registers, after which rr may detect a divergence.

I'm not sure what, if anything, can be done about this. x86-64 applications seldom use x87 instructions, but they do get used. For example my Fedora 27 libstdc++ std::__detail::_Prime_rehash_policy::_M_next_bkt(unsigned long) has been compiled to a mix of x87 and SSE instructions.