Monday, 7 March 2016

My Self-Identification

The Hackernews discussion about me leaving Mozilla touched on how I identify myself as a Christian at the top of my blog. I do it for a few reasons.

One person said it does turn me off because of what it might imply As far as personal biases. That's close to one reason why I do it: my Christian worldview should pervade everything I do, and I think it's honest to declare those "biases" openly so people can take them into account. When this leads some people to form negative first impressions, I think that's better than letting them proceed with false assumptions.

A pitfall for that approach is that so many people's impression of what it means to be a Christian is inaccurate. But that's another reason why I self-identify this way: the more "ordinary Christians" stay in the closet, the more people's impressions of Christians are shaped by the extreme cases the media likes to dig up.

One reason I don't do it is virtue signalling. For the great majority of people who encounter me through my blog, being a Christian is no virtue.

Some people in the Hackernews thread saw it as proselytising. It really wasn't, just as "repatriate Kiwi, Mozilla hacker." was not encouragement to move to New Zealand and work on Firefox. Having said that, I'm all for proselytising in the right context. You can't be a Christian, believing that you have found the key to having your sins forgiven and being granted a relationship with the God of the universe, and decide to keep it to yourself. Sure, I often find proselytisers of various kinds annoying, but I respect them for following through on the logic of their convictions.

Wednesday, 2 March 2016

Leaving Mozilla

I'm leaving Mozilla. My last day will be this Friday, March 4 --- NZ time.

This isn't the time to go into all the reasons why, but I will mention a couple of big ones. I've been working on the Web platform for sixteen years, full-time for eleven. It was necessary and I'm so proud of what we did, but it was never the most fun thing I could have been doing and I'm very tired of it. And now there's rr. People think it's amazing, and it is, but we're only scratching the surface of what could be done. I'm bursting with ideas, and realizing them will require a lot of resources, resources that Mozilla cannot and should not provide --- we probably need to capture some of the value of this technology via a for-profit company. We can change the way people debug software, and in its own way that may be as important as my Web platform work, and it's work I desperately want to do.

I do not plan to disappear entirely from the Mozilla project. rr is the best research I've ever done partly because we've relentlessly focused on addressing real developer needs, so I want to stay engaged with Mozilla developers, helping them use rr and building tools for them that are a lot more awesome.

This is hard. I love Mozilla and its people so much, and this is a bit of a step into the unknown. I'm been singing this a lot lately...

Wednesday, 24 February 2016

"These Bugs Are Impossible To Fix Without rr"

Jan de Mooij and the Mozilla JS team just used rr to figure out a GC-related top-crash in Firefox Nightly. The IRC discussion starts here.

I think rr is great for most kinds of bugs, not just the really hard ones. But obviously hard bugs are where there's the most win to be had.

Tuesday, 23 February 2016

Deeper Into Chaos

A couple of weeks ago we introduced rr chaos mode. Since then quite a few people have tried it out, with a lot of success and some failures. Studying one bug that chaos mode could not reproduce led to some insights that have helped me make some improvements that are now on rr master. The main insight is that while some bugs need a thread to be starved for a long time, other bugs only need a thread to be starved for a short time, but the starvation must happen in a very narrow window of vulnerability. So for these latter bugs, we should introduce very short delays, but we should introduce them very frequently.

Taking a step back, let's assume that for some test we can reproduce a failure if we completely avoid scheduling a certain thread for a period of D seconds, where the start of that period must fall between S and S+T seconds since the start of the test. All these constants are unknown to rr, but we'll assume 1ms <= D <= 2s. Our goal is to come up with a randomized strategy for introducing delays that will reproduce the failure within a reasonable number of runs. Since we only need to reproduce any particular bug once, it would be best to have roughly similar probabilities for reproducing each bug given its unknown parameters (rather than have some bugs very easy to reproduce and other bugs be nearly impossible). I have no idea of the optimal approach here, but here's one that seems reasonable...

First we have to pick the right thread to treat as low priority --- without making many other threads low priority, since they might need to run while our victim thread is being starved. So we give each thread a 0.1 probability of being low priority, except for the main thread which we make 0.3, since starving the main thread is often very interesting.

Then we guess a value D' for D. We uniformly choose between 1ms, 2ms, 4ms, 8ms, ..., 1s, 2s. Out of these 12 possibilities, one is between D and 2xD.

We adopt the goal of high-priority-only intervals consuming at most 20% of running time. (If chaos mode slows down a test too much, we might trigger false-positives by inducing timeouts, and avoiding false positives is extremely important.) To maximise the probability of triggering the test failure, we start high-priority-only intervals as often as possible, i.e. one for D' seconds starting every 5xD' seconds. The start time of the first interval is chosen uniformly randomly to be between 0 and 4xD'.

If we guessed D' and the low-priority thread correctly, the probability of triggering the test failure is 1 if T >= 4xD', T/4xD' otherwise, i.e. >= T/8xD. (Higher values of D' than optimal can also trigger failures, but at reduced probabilities since we can schedule them less often.)

I've written some artificial testcases showing that this works. I've also been able to reproduce the motivating bug from the first paragraph. I think it's a clear improvement over the previous approach. No doubt as more bugs come up that we can't reproduce, we can make further improvements.

Rewrite Everything In Rust

I just read Dan Kaminsky's post about the glibc DNS vulnerability and its terrifying implications. Unfortunately it's just one of many, many, many critical software vulnerabilities that have made computer security a joke.

It's no secret that we have the technology to prevent most of these bugs. We have programming languages that practically guarantee important classes of bugs don't happen. The problem is that so much of our software doesn't use these languages. Until recently, there were good excuses for that; "safe" programming languages have generally been unsuitable for systems programming because they don't give you complete control over resources, and they require complex runtime support that doesn't fit in certain contexts (e.g. kernels).

Rust is changing all that. We now have a language with desirable safety properties that offers the control you need for systems programming and does not impose a runtime. Its growing community shows that people enjoy programming in Rust. Servo shows that large, complex Rust applications can perform well.

For the good of the Internet, and in fact humanity, we need to migrate our software from C/C++ to Rust (or something better) as quickly as possible. Here are some steps we need to take:

  • Use Rust. Using Rust instead of C/C++ makes code safer. Using Rust instead of any other language grows the Rust community, making it easier for other people to use Rust.
  • Rewrite code in Rust. Starting with our most critical infrastructure, rewrite C/C++ code to use Rust instead.
  • Extend Rust's guarantees. We can extend the classes of bugs that Rust prevents. For example, we should try to make Rust release builds check for integer overflow by default.
  • Verify "unsafe" Rust code. Sometimes the Rust type system is not strong enough to let you prove to the compiler that your code is safe, so you have to mark code blocks "unsafe". With tool support, you could instead generate a mathematical proof that your code is safe, checked by the compiler. (See Rustbelt.)
  • Make Rust implementations safer. Bugs in the compiler can invalidate the language's guarantees. We know how to build compilers that generate, along with the compiled code, a proof that the code maintains the language guarantees --- a proof that can be checked by a simple, trusted proof checker. Part of this would involve proving properties of the Rust language itself.

Of course, the language doesn't have to be Rust, but Rust is the best candidate I know of at this time.

This is a huge amount of work, but consider the enormous ongoing costs --- direct and indirect --- of the vulnerabilities that Rust would have prevented.

Wednesday, 10 February 2016

Introducing rr Chaos Mode

Most of my rr talks start with a slide showing some nondeterministic failures in test automation while I explain how hard it is to debug such failures and how record-and-replay can help. But the truth is that until now we haven't really used rr for that, because it has often been difficult to get nondeterministic failures in test automation to show up under rr. So rr's value has mainly been speeding up debugging of failures that were relatively easy to reproduce. I guessed that enhancing rr recording to better reproduce intermittent bugs is one area where a small investment could quickly pay off for Mozilla, so I spent some time working on that over the last couple of months.

Based on my experience fixing nondeterministic Gecko test failures, I hypothesized that our nondeterministic test failures are mainly caused by changes in scheduling. I studied a particular intermittent test failure that I introduced and fixed, where I completely understood the bug but the test had only failed a few times on Android and nowhere else, and thousands of runs under rr could not reproduce the bug. Knowing what the bug was, I was able to show that sleeping for a second at a certain point in the code when called on the right thread (the ImageBridge thread) at the right moment would reproduce the bug reliably on desktop Linux. The tricky part was to come up with a randomized scheduling policy for rr that would produce similar results without prior knowledge of the bug.

I first tried the obvious: allow the lengths of timeslices to vary randomly; give threads random priorities and observe them strictly; reset the random priorities periodically; schedule threads with the same priority in random order. This didn't work, for an interesting reason. To trigger my bug, we have to avoid scheduling the ImageBridge thread while the main thread waits for a 500ms timeout to expire. During that time the ImageBridge thread is the only runnable thread, so any approach that can only influence which runnable thread to run next (e.g. CHESS) will not be able to reproduce this bug.

To cut a long story short, here's an approach that works. Use just two thread priorities, "high" and "low". Make most threads high-priority; I give each thread a 0.1 probability of being low priority. Periodically re-randomize thread priorities. Randomize timeslice lengths. Here's the good part: periodically choose a short random interval, up to a few seconds long, and during that interval do not allow low-priority threads to run at all, even if they're the only runnable threads. Since these intervals can prevent all forward progress (no control of priority inversion), limit their length to no more than 20% of total run time. The intuition is that many of our intermittent test failures depend on CPU starvation (e.g. a machine temporarily hanging), so we're emulating intense starvation of a few "victim" threads, and allowing high-priority threads to wait for timeouts or input from the environment without interruption.

With this approach, rr can reproduce my bug in several runs out of a thousand. I've also been able to reproduce a top intermittent (now being fixed), an intermittent test failure that was assigned to me, and an intermittent shutdown hang in IndexedDB we've been chasing for a while. A couple of other people have found this enabled reproducing their bugs. I'm sure there are still bugs this approach can't reproduce, but it's good progress.

I just landed all this work on rr master. The normal scheduler doesn't do this randomization, because it reduces throughput, i.e. slows down recording for easy-to-reproduce bugs. Run rr record -h to enable chaos mode for hard-to-reproduce bugs.

I'm very interested in studying more cases where we figure out a bug that rr chaos mode was not able to reproduce, so I can extend chaos mode to find such bugs.

Friday, 5 February 2016

rr Talk At linux.conf.au

For the last few days I've been attending linux.conf.au, and yesterday I gave a talk about rr. The talk is now online. It was a lot of fun and I got some good questions!