I think rr is great for most kinds of bugs, not just the really hard ones. But obviously hard bugs are where there's the most win to be had.
Tuesday, 23 February 2016
A couple of weeks ago we introduced rr chaos mode. Since then quite a few people have tried it out, with a lot of success and some failures. Studying one bug that chaos mode could not reproduce led to some insights that have helped me make some improvements that are now on rr master. The main insight is that while some bugs need a thread to be starved for a long time, other bugs only need a thread to be starved for a short time, but the starvation must happen in a very narrow window of vulnerability. So for these latter bugs, we should introduce very short delays, but we should introduce them very frequently.
Taking a step back, let's assume that for some test we can reproduce a failure if we completely avoid scheduling a certain thread for a period of D seconds, where the start of that period must fall between S and S+T seconds since the start of the test. All these constants are unknown to rr, but we'll assume 1ms <= D <= 2s. Our goal is to come up with a randomized strategy for introducing delays that will reproduce the failure within a reasonable number of runs. Since we only need to reproduce any particular bug once, it would be best to have roughly similar probabilities for reproducing each bug given its unknown parameters (rather than have some bugs very easy to reproduce and other bugs be nearly impossible). I have no idea of the optimal approach here, but here's one that seems reasonable...
First we have to pick the right thread to treat as low priority --- without making many other threads low priority, since they might need to run while our victim thread is being starved. So we give each thread a 0.1 probability of being low priority, except for the main thread which we make 0.3, since starving the main thread is often very interesting.
Then we guess a value D' for D. We uniformly choose between 1ms, 2ms, 4ms, 8ms, ..., 1s, 2s. Out of these 12 possibilities, one is between D and 2xD.
We adopt the goal of high-priority-only intervals consuming at most 20% of running time. (If chaos mode slows down a test too much, we might trigger false-positives by inducing timeouts, and avoiding false positives is extremely important.) To maximise the probability of triggering the test failure, we start high-priority-only intervals as often as possible, i.e. one for D' seconds starting every 5xD' seconds. The start time of the first interval is chosen uniformly randomly to be between 0 and 4xD'.
If we guessed D' and the low-priority thread correctly, the probability of triggering the test failure is 1 if T >= 4xD', T/4xD' otherwise, i.e. >= T/8xD. (Higher values of D' than optimal can also trigger failures, but at reduced probabilities since we can schedule them less often.)
I've written some artificial testcases showing that this works. I've also been able to reproduce the motivating bug from the first paragraph. I think it's a clear improvement over the previous approach. No doubt as more bugs come up that we can't reproduce, we can make further improvements.
I just read Dan Kaminsky's post about the glibc DNS vulnerability and its terrifying implications. Unfortunately it's just one of many, many, many critical software vulnerabilities that have made computer security a joke.
It's no secret that we have the technology to prevent most of these bugs. We have programming languages that practically guarantee important classes of bugs don't happen. The problem is that so much of our software doesn't use these languages. Until recently, there were good excuses for that; "safe" programming languages have generally been unsuitable for systems programming because they don't give you complete control over resources, and they require complex runtime support that doesn't fit in certain contexts (e.g. kernels).
Rust is changing all that. We now have a language with desirable safety properties that offers the control you need for systems programming and does not impose a runtime. Its growing community shows that people enjoy programming in Rust. Servo shows that large, complex Rust applications can perform well.
For the good of the Internet, and in fact humanity, we need to migrate our software from C/C++ to Rust (or something better) as quickly as possible. Here are some steps we need to take:
- Use Rust. Using Rust instead of C/C++ makes code safer. Using Rust instead of any other language grows the Rust community, making it easier for other people to use Rust.
- Rewrite code in Rust. Starting with our most critical infrastructure, rewrite C/C++ code to use Rust instead.
- Extend Rust's guarantees. We can extend the classes of bugs that Rust prevents. For example, we should try to make Rust release builds check for integer overflow by default.
- Verify "unsafe" Rust code. Sometimes the Rust type system is not strong enough to let you prove to the compiler that your code is safe, so you have to mark code blocks "unsafe". With tool support, you could instead generate a mathematical proof that your code is safe, checked by the compiler. (See Rustbelt.)
- Make Rust implementations safer. Bugs in the compiler can invalidate the language's guarantees. We know how to build compilers that generate, along with the compiled code, a proof that the code maintains the language guarantees --- a proof that can be checked by a simple, trusted proof checker. Part of this would involve proving properties of the Rust language itself.
Of course, the language doesn't have to be Rust, but Rust is the best candidate I know of at this time.
This is a huge amount of work, but consider the enormous ongoing costs --- direct and indirect --- of the vulnerabilities that Rust would have prevented.
Wednesday, 10 February 2016
Most of my rr talks start with a slide showing some nondeterministic failures in test automation while I explain how hard it is to debug such failures and how record-and-replay can help. But the truth is that until now we haven't really used rr for that, because it has often been difficult to get nondeterministic failures in test automation to show up under rr. So rr's value has mainly been speeding up debugging of failures that were relatively easy to reproduce. I guessed that enhancing rr recording to better reproduce intermittent bugs is one area where a small investment could quickly pay off for Mozilla, so I spent some time working on that over the last couple of months.
Based on my experience fixing nondeterministic Gecko test failures, I hypothesized that our nondeterministic test failures are mainly caused by changes in scheduling. I studied a particular intermittent test failure that I introduced and fixed, where I completely understood the bug but the test had only failed a few times on Android and nowhere else, and thousands of runs under rr could not reproduce the bug. Knowing what the bug was, I was able to show that sleeping for a second at a certain point in the code when called on the right thread (the ImageBridge thread) at the right moment would reproduce the bug reliably on desktop Linux. The tricky part was to come up with a randomized scheduling policy for rr that would produce similar results without prior knowledge of the bug.
I first tried the obvious: allow the lengths of timeslices to vary randomly; give threads random priorities and observe them strictly; reset the random priorities periodically; schedule threads with the same priority in random order. This didn't work, for an interesting reason. To trigger my bug, we have to avoid scheduling the ImageBridge thread while the main thread waits for a 500ms timeout to expire. During that time the ImageBridge thread is the only runnable thread, so any approach that can only influence which runnable thread to run next (e.g. CHESS) will not be able to reproduce this bug.
To cut a long story short, here's an approach that works. Use just two thread priorities, "high" and "low". Make most threads high-priority; I give each thread a 0.1 probability of being low priority. Periodically re-randomize thread priorities. Randomize timeslice lengths. Here's the good part: periodically choose a short random interval, up to a few seconds long, and during that interval do not allow low-priority threads to run at all, even if they're the only runnable threads. Since these intervals can prevent all forward progress (no control of priority inversion), limit their length to no more than 20% of total run time. The intuition is that many of our intermittent test failures depend on CPU starvation (e.g. a machine temporarily hanging), so we're emulating intense starvation of a few "victim" threads, and allowing high-priority threads to wait for timeouts or input from the environment without interruption.
With this approach, rr can reproduce my bug in several runs out of a thousand. I've also been able to reproduce a top intermittent (now being fixed), an intermittent test failure that was assigned to me, and an intermittent shutdown hang in IndexedDB we've been chasing for a while. A couple of other people have found this enabled reproducing their bugs. I'm sure there are still bugs this approach can't reproduce, but it's good progress.
I just landed all this work on rr master. The normal scheduler doesn't do this randomization, because it reduces throughput, i.e. slows down recording for easy-to-reproduce bugs. Run rr record -h to enable chaos mode for hard-to-reproduce bugs.
I'm very interested in studying more cases where we figure out a bug that rr chaos mode was not able to reproduce, so I can extend chaos mode to find such bugs.
Friday, 5 February 2016
Thursday, 4 February 2016
This release mainly improves replay performance dramatically, as I documented in November. It took a while to stabilize for release, partly because we ran into a kernel bug that caused rr tests (and sometimes real rr usage) to totally lock up machines. This release contains a workaround for that kernel bug. It also contains support for the gdb find command, and fixes for a number of other bugs.
Tuesday, 2 February 2016
I just rewatched the movies for the first time in a long time. I had wondered whether my enthusiasm for them would have worn off, but no; I think they've aged extremely well (unlike some other classics). Watching the extended editions all together in one weekend is a great experience; the ending really pays off because you feel the epicness of the journey.
My favourite movies leave me feeling, not excited or sad or vengeful, but heroically inspired, with a burning desire to go out and do great and noble deeds. Both the book and the movies give me that.
Rewatching the Shire scenes having been to the Hobbiton set is pretty cool.
The only sad part is that it drives home what a missed opportunity the Hobbit movies were.
After finishing the Kepler Track, we had travel/rest day taking the bus from Te Anau to Invercargill (via Gore) and then flying to Oban in Stewart Island. Population 350, it's the southernmost town in New Zealand and the base for anything you might want to do in Stewart Island. We were there to spend three days walking the Rakiura Track.
The Kepler is mostly about lakes, valleys and mountains. Stewart Island is all about the ocean and the forest. The Rakiura Track is only a small section of the big tramping tracks on the island; the biggest is the North West Circuit, which apparently takes about ten days at normal speed, though we met a group who were doing it in seven; no small feat given the legendary amount of mud on that track. We had it relatively easy: the second day's leg from Port William hut to North Arm hut was the muddiest I've seen on a "Great Walk", but not much got into my boots. The short section of the North West Circuit I walked from Port William to the other side of the peninsula had mud up to my calves. All Stewart Island tracks are definitely worth bringing hiking boots for.
It would have been odd to do two South Island tramps with no rain, so it was a bit of a relief that we had rain on the second day. Fortunately we made it from Port William to North Arm by 1pm, avoiding the worst of the rain and giving us time to dry our gear by the fire before the hut filled up.
One feature of this tramp was that we made popcorn in the huts. We'd seen it done on the Kepler track and decided we could do it too, and we were right. The unpopped corn and oil are all you need and easy to carry. A good way to pass a rainy afternoon!
As on most tramps we had great fun talking to other trampers, especially the foreign tourists! We didn't meet many children this time so we didn't get many chances to teach our card games, but we played many hours of Bang! and Citadels amongst ourselves.
Update David has a lot more (and better) photos from our trip.
Monday, 1 February 2016
Two weeks ago I spent four days walking the Kepler Track in the South Island with my kids and a friend. It is, as advertised, a "great walk". We had excellent weather, with scarcely a drop of rain --- unusual in Fiordland. There was a lot of cloud, but it worked out well for us: on all days but the second, we were below the clouds, and we spent most of the second day above the clouds, with spectacular views of many mountain-tops rising above a white fluffy ocean. There's a nice variety of terrain --- lakes Te Anau and Manapouri (surrounded by mountains), lots of walking beside streams and rivers, different kinds of forest, and alpine tussock ascending into barren rocky peaks. The Luxmore Cave is also definitely worth a visit, though we were too timid to descend far into it. Bring two lights per person in case one goes out!