Wednesday, 10 February 2016

Introducing rr Chaos Mode

Most of my rr talks start with a slide showing some nondeterministic failures in test automation while I explain how hard it is to debug such failures and how record-and-replay can help. But the truth is that until now we haven't really used rr for that, because it has often been difficult to get nondeterministic failures in test automation to show up under rr. So rr's value has mainly been speeding up debugging of failures that were relatively easy to reproduce. I guessed that enhancing rr recording to better reproduce intermittent bugs is one area where a small investment could quickly pay off for Mozilla, so I spent some time working on that over the last couple of months.

Based on my experience fixing nondeterministic Gecko test failures, I hypothesized that our nondeterministic test failures are mainly caused by changes in scheduling. I studied a particular intermittent test failure that I introduced and fixed, where I completely understood the bug but the test had only failed a few times on Android and nowhere else, and thousands of runs under rr could not reproduce the bug. Knowing what the bug was, I was able to show that sleeping for a second at a certain point in the code when called on the right thread (the ImageBridge thread) at the right moment would reproduce the bug reliably on desktop Linux. The tricky part was to come up with a randomized scheduling policy for rr that would produce similar results without prior knowledge of the bug.

I first tried the obvious: allow the lengths of timeslices to vary randomly; give threads random priorities and observe them strictly; reset the random priorities periodically; schedule threads with the same priority in random order. This didn't work, for an interesting reason. To trigger my bug, we have to avoid scheduling the ImageBridge thread while the main thread waits for a 500ms timeout to expire. During that time the ImageBridge thread is the only runnable thread, so any approach that can only influence which runnable thread to run next (e.g. CHESS) will not be able to reproduce this bug.

To cut a long story short, here's an approach that works. Use just two thread priorities, "high" and "low". Make most threads high-priority; I give each thread a 0.1 probability of being low priority. Periodically re-randomize thread priorities. Randomize timeslice lengths. Here's the good part: periodically choose a short random interval, up to a few seconds long, and during that interval do not allow low-priority threads to run at all, even if they're the only runnable threads. Since these intervals can prevent all forward progress (no control of priority inversion), limit their length to no more than 20% of total run time. The intuition is that many of our intermittent test failures depend on CPU starvation (e.g. a machine temporarily hanging), so we're emulating intense starvation of a few "victim" threads, and allowing high-priority threads to wait for timeouts or input from the environment without interruption.

With this approach, rr can reproduce my bug in several runs out of a thousand. I've also been able to reproduce a top intermittent (now being fixed), an intermittent test failure that was assigned to me, and an intermittent shutdown hang in IndexedDB we've been chasing for a while. A couple of other people have found this enabled reproducing their bugs. I'm sure there are still bugs this approach can't reproduce, but it's good progress.

I just landed all this work on rr master. The normal scheduler doesn't do this randomization, because it reduces throughput, i.e. slows down recording for easy-to-reproduce bugs. Run rr record -h to enable chaos mode for hard-to-reproduce bugs.

I'm very interested in studying more cases where we figure out a bug that rr chaos mode was not able to reproduce, so I can extend chaos mode to find such bugs.


  1. Another approach to this particular issue might be to add an idle loop Gecko thread that can always be scheduled.

    1. A couple of problems with that:
      a) doesn't extend to non-Gecko projects
      b) that idle thread would starve out low-priority threads permanently, causing deadlock. It's really important to limit the "complete starvation" intervals to a bounded fraction of run time.

  2. I'm still not sure I fully understand how this was happening on Android. Was it the case that the main thread did (the equivalent of) setTimeout(500) and then did some other stuff that took longer than 500ms on the failing runs, then without rescheduling anything the timer expired and the handler ran, all still on the main thread? Or was it a setTimeout(500) on the main thread followed by scheduling in some other thread that took a while to finish (on the failing Android runs, but was much faster on desktop runs), and then the timeout expired at a time when both the main and ImageBridge threads were runnable?

    You said that only the ImageBridge thread was runnable, which to me says it was the only thread runnable *during passing runs*, but that on failing runs both the main and ImageBridge threads *were* runnable and either the scheduler picked the main thread or there was no scheduling opportunity. Is at least that much correct?

    1. I don't know for sure what was happening on Android, but I think what was happening was "main thread setTimeout(500)", "no Gecko threads are scheduled for > 500ms because Android", "main thread runs timeout callback" and ImageBridge was runnable the entire time.