Wednesday, 10 February 2016

Introducing rr Chaos Mode

Most of my rr talks start with a slide showing some nondeterministic failures in test automation while I explain how hard it is to debug such failures and how record-and-replay can help. But the truth is that until now we haven't really used rr for that, because it has often been difficult to get nondeterministic failures in test automation to show up under rr. So rr's value has mainly been speeding up debugging of failures that were relatively easy to reproduce. I guessed that enhancing rr recording to better reproduce intermittent bugs is one area where a small investment could quickly pay off for Mozilla, so I spent some time working on that over the last couple of months.

Based on my experience fixing nondeterministic Gecko test failures, I hypothesized that our nondeterministic test failures are mainly caused by changes in scheduling. I studied a particular intermittent test failure that I introduced and fixed, where I completely understood the bug but the test had only failed a few times on Android and nowhere else, and thousands of runs under rr could not reproduce the bug. Knowing what the bug was, I was able to show that sleeping for a second at a certain point in the code when called on the right thread (the ImageBridge thread) at the right moment would reproduce the bug reliably on desktop Linux. The tricky part was to come up with a randomized scheduling policy for rr that would produce similar results without prior knowledge of the bug.

I first tried the obvious: allow the lengths of timeslices to vary randomly; give threads random priorities and observe them strictly; reset the random priorities periodically; schedule threads with the same priority in random order. This didn't work, for an interesting reason. To trigger my bug, we have to avoid scheduling the ImageBridge thread while the main thread waits for a 500ms timeout to expire. During that time the ImageBridge thread is the only runnable thread, so any approach that can only influence which runnable thread to run next (e.g. CHESS) will not be able to reproduce this bug.

To cut a long story short, here's an approach that works. Use just two thread priorities, "high" and "low". Make most threads high-priority; I give each thread a 0.1 probability of being low priority. Periodically re-randomize thread priorities. Randomize timeslice lengths. Here's the good part: periodically choose a short random interval, up to a few seconds long, and during that interval do not allow low-priority threads to run at all, even if they're the only runnable threads. Since these intervals can prevent all forward progress (no control of priority inversion), limit their length to no more than 20% of total run time. The intuition is that many of our intermittent test failures depend on CPU starvation (e.g. a machine temporarily hanging), so we're emulating intense starvation of a few "victim" threads, and allowing high-priority threads to wait for timeouts or input from the environment without interruption.

With this approach, rr can reproduce my bug in several runs out of a thousand. I've also been able to reproduce a top intermittent (now being fixed), an intermittent shutdown hang in IndexedDB we've been chasing for a while, and at least one other person has found this enabled reproducing their bug. I'm sure there are still bugs this approach can't reproduce, but it's good progress.

I just landed all this work on rr master. The normal scheduler doesn't do this randomization, because it reduces throughput, i.e. slows down recording for easy-to-reproduce bugs. Run rr record -h to enable chaos mode for hard-to-reproduce bugs.

I'm very interested in studying more cases where we figure out a bug that rr chaos mode was not able to reproduce, so I can extend chaos mode to find such bugs.

Friday, 5 February 2016

rr Talk At linux.conf.au

For the last few days I've been attending linux.conf.au, and yesterday I gave a talk about rr. The talk is now online. It was a lot of fun and I got some good questions!

Thursday, 4 February 2016

rr 4.1.0 Released

This release mainly improves replay performance dramatically, as I documented in November. It took a while to stabilize for release, partly because we ran into a kernel bug that caused rr tests (and sometimes real rr usage) to totally lock up machines. This release contains a workaround for that kernel bug. It also contains support for the gdb find command, and fixes for a number of other bugs.

Tuesday, 2 February 2016

Reflecting On The The Lord Of The Rings Movies

I just rewatched the movies for the first time in a long time. I had wondered whether my enthusiasm for them would have worn off, but no; I think they've aged extremely well (unlike some other classics). Watching the extended editions all together in one weekend is a great experience; the ending really pays off because you feel the epicness of the journey.

My favourite movies leave me feeling, not excited or sad or vengeful, but heroically inspired, with a burning desire to go out and do great and noble deeds. Both the book and the movies give me that.

Rewatching the Shire scenes having been to the Hobbiton set is pretty cool.

The only sad part is that it drives home what a missed opportunity the Hobbit movies were.

Rakiura Track

After finishing the Kepler Track, we had travel/rest day taking the bus from Te Anau to Invercargill (via Gore) and then flying to Oban in Stewart Island. Population 350, it's the southernmost town in New Zealand and the base for anything you might want to do in Stewart Island. We were there to spend three days walking the Rakiura Track.

The Kepler is mostly about lakes, valleys and mountains. Stewart Island is all about the ocean and the forest. The Rakiura Track is only a small section of the big tramping tracks on the island; the biggest is the North West Circuit, which apparently takes about ten days at normal speed, though we met a group who were doing it in seven; no small feat given the legendary amount of mud on that track. We had it relatively easy: the second day's leg from Port William hut to North Arm hut was the muddiest I've seen on a "Great Walk", but not much got into my boots. The short section of the North West Circuit I walked from Port William to the other side of the peninsula had mud up to my calves. All Stewart Island tracks are definitely worth bringing hiking boots for.

It would have been odd to do two South Island tramps with no rain, so it was a bit of a relief that we had rain on the second day. Fortunately we made it from Port William to North Arm by 1pm, avoiding the worst of the rain and giving us time to dry our gear by the fire before the hut filled up.

One feature of this tramp was that we made popcorn in the huts. We'd seen it done on the Kepler track and decided we could do it too, and we were right. The unpopped corn and oil are all you need and easy to carry. A good way to pass a rainy afternoon!

As on most tramps we had great fun talking to other trampers, especially the foreign tourists! We didn't meet many children this time so we didn't get many chances to teach our card games, but we played many hours of Bang! and Citadels amongst ourselves.

Update David has a lot more (and better) photos from our trip.

Monday, 1 February 2016

Kepler Track

Two weeks ago I spent four days walking the Kepler Track in the South Island with my kids and a friend. It is, as advertised, a "great walk". We had excellent weather, with scarcely a drop of rain --- unusual in Fiordland. There was a lot of cloud, but it worked out well for us: on all days but the second, we were below the clouds, and we spent most of the second day above the clouds, with spectacular views of many mountain-tops rising above a white fluffy ocean. There's a nice variety of terrain --- lakes Te Anau and Manapouri (surrounded by mountains), lots of walking beside streams and rivers, different kinds of forest, and alpine tussock ascending into barren rocky peaks. The Luxmore Cave is also definitely worth a visit, though we were too timid to descend far into it. Bring two lights per person in case one goes out!

Thursday, 14 January 2016

Making Honest Money With The Internet Of Things

I do not like the Internet Of Things.

Nest just demonstrated a great reason why. VW recently showed us what can be done when a device can detect that it's under test. At some point, increasing the smartness of a device is an expected net loss of utility, and for some devices we're clearly crossing that line.

We're not just crossing it, we're surging past it in a classic Silicon Valley feeding frenzy. Expectation that IoT will be the next big thing is a self-fulfilling prophecy, and market pressure favours --- nay, forces --- short-term thinking and deceptive marketing.

The good news is that if you want to cash in on the IoT boom without being part of the problem, there are ways. In ten years, when you rent an apartment or move offices, you'll want to somehow eradicate all the tiny battery-powered IoT devices installed and forgotten about by the previous tenant, which have been compromised and turned into spybots and ransomware. Someone will make a lot of money if they can solve that problem.