Most of my rr talks start with a slide showing some nondeterministic failures in test automation while I explain how hard it is to debug such failures and how record-and-replay can help. But the truth is that until now we haven't really used rr for that, because it has often been difficult to get nondeterministic failures in test automation to show up under rr. So rr's value has mainly been speeding up debugging of failures that were relatively easy to reproduce. I guessed that enhancing rr recording to better reproduce intermittent bugs is one area where a small investment could quickly pay off for Mozilla, so I spent some time working on that over the last couple of months.
Based on my experience fixing nondeterministic Gecko test failures, I hypothesized that our nondeterministic test failures are mainly caused by changes in scheduling. I studied a particular intermittent test failure that I introduced and fixed, where I completely understood the bug but the test had only failed a few times on Android and nowhere else, and thousands of runs under rr could not reproduce the bug. Knowing what the bug was, I was able to show that sleeping for a second at a certain point in the code when called on the right thread (the ImageBridge thread) at the right moment would reproduce the bug reliably on desktop Linux. The tricky part was to come up with a randomized scheduling policy for rr that would produce similar results without prior knowledge of the bug.
I first tried the obvious: allow the lengths of timeslices to vary randomly; give threads random priorities and observe them strictly; reset the random priorities periodically; schedule threads with the same priority in random order. This didn't work, for an interesting reason. To trigger my bug, we have to avoid scheduling the ImageBridge thread while the main thread waits for a 500ms timeout to expire. During that time the ImageBridge thread is the only runnable thread, so any approach that can only influence which runnable thread to run next (e.g. CHESS) will not be able to reproduce this bug.
To cut a long story short, here's an approach that works. Use just two thread priorities, "high" and "low". Make most threads high-priority; I give each thread a 0.1 probability of being low priority. Periodically re-randomize thread priorities. Randomize timeslice lengths. Here's the good part: periodically choose a short random interval, up to a few seconds long, and during that interval do not allow low-priority threads to run at all, even if they're the only runnable threads. Since these intervals can prevent all forward progress (no control of priority inversion), limit their length to no more than 20% of total run time. The intuition is that many of our intermittent test failures depend on CPU starvation (e.g. a machine temporarily hanging), so we're emulating intense starvation of a few "victim" threads, and allowing high-priority threads to wait for timeouts or input from the environment without interruption.
With this approach, rr can reproduce my bug in several runs out of a thousand. I've also been able to reproduce a top intermittent (now being fixed), an intermittent shutdown hang in IndexedDB we've been chasing for a while, and at least one other person has found this enabled reproducing their bug. I'm sure there are still bugs this approach can't reproduce, but it's good progress.
I just landed all this work on rr master. The normal scheduler doesn't do this randomization, because it reduces throughput, i.e. slows down recording for easy-to-reproduce bugs. Run rr record -h to enable chaos mode for hard-to-reproduce bugs.
I'm very interested in studying more cases where we figure out a bug that rr chaos mode was not able to reproduce, so I can extend chaos mode to find such bugs.