Thursday, 27 March 2014

Conflict

I was standing in a long line of people waiting to get their passports checked before departing Auckland to the USA. A man pushed into the line ahead of me. People around me sighed and muttered, but no-one did anything. This triggered my "if no-one else is willing to act, it's up to me" instinct, so I started a conversation:

Roc: Shouldn't you go to the back of the line?
Man: I'm in a hurry to get to my flight.
There are only two flights to the USA in the immediate future and AFAIK neither of them is imminent.
Roc: Which flight are you on?
Man: What's the big deal? I'm hardly going to slow you down.
Roc: But it's rude.
Man: So what do you want me to do?
Roc: I want you to move to the back of the line.
Man: You're going to have to make me.

At that point I couldn't think of anything to say or do so I let it go. It turned out that he was on my flight and it would have been unpleasant if he'd ended up sitting next to me.

I was a bit shocked by this incident, though I probably shouldn't have been. This kind of unapologetic rudeness is not something I encounter very often from strangers in face-to-face situations. I guess that means I'm fortunate. Surprisingly I felt quite calm during the conversation and only felt rage later.

Embarrassingly-much-later, it dawned on me that Jesus wants me to forgive this man. I am now making the effort to do so.

Wednesday, 26 March 2014

Introducing rr

Bugs that reproduce intermittently are hard to debug with traditional techniques because single stepping, setting breakpoints, inspecting program state, etc, is all a waste of time if the program execution you're debugging ends up not even exhibiting the bug. Even when you can reproduce a bug consistently, important information such as the addresses of suspect objects is unpredictable from run to run. Given that software developers like me spend the majority of their time finding and fixing bugs (either in new code or existing code), nondeterminism has a major impact on my productivity.

Many, many people have noticed that if we had a way to reliably record program execution and replay it later, with the ability to debug the replay, we could largely tame the nondeterminism problem. This would also allow us to deliberately introduce nondeterminism so tests can explore more of the possible execution space, without impacting debuggability. Many record and replay systems have been built in pursuit of this vision. (I built one myself.) For various reasons these systems have not seen wide adoption. So, a few years ago we at Mozilla started a project to create a new record-and-replay tool that would overcome the obstacles blocking adoption. We call this tool rr.

Design

Here are some of our key design parameters:

  • We focused on debugging Firefox. Firefox is a complex application, so if rr is useful for debugging Firefox, it is very likely to be generally useful. But, for example, we have not concerned ourselves with record and replay of hostile binaries, or highly parallel applications. Nor have we explored novel techniques just for the sake of novelty.
  • We prioritized deployability. For example, we deliberately avoided modifying the OS kernel and even worked hard to eliminate the need for system configuration changes. Of course we ensured rr works on stock hardware.
  • We placed high importance on low run-time overhead. We want rr to be the tool of first resort for debugging. That means you need to start getting results with rr about as quickly as you would if you were using a traditional debugger. (This is where my previous project Chronomancer fell down.)
  • We tried to take a path that would lead to a quick positive payoff with a small development investment. A large speculative project in this area would fall outside Mozilla's resources and mission.
  • Naturally, the tool has to be free software.

I believe we've achieved those goals with our 1.0 release.

There's a lot to say about how rr works, and I'll probably write some followup blog posts about that. In this post I focus on what rr can do.

rr records and replays a Linux process tree, i.e. an initial process and the threads and subprocesses (indirectly) spawned by that process. The replay is exact in the sense that the contents of memory and registers during replay are identical to their values during recording. rr provides a gdbserver interface allowing gdb to debug the state of replayed processes.

Performance

Here are performance results for some Firefox testsuite workloads: These represent the ratio of wall-clock run-time of rr recording and replay over the wall-clock run-time of normal execution (except for Octane, where it's the ratio of the normal execution's Octane score over the record/replay Octane scores ... which, of course, are the same). It turns out that reftests are slow under rr because Gecko's current default Linux configuration draws with X11, hence the rendered pixmap of every test and reference page has to be slurped back from the X server over the X socket for comparison, and rr has to record all of that data. So I also show overhead for reftests with Xrender disabled, which causes Gecko to draw locally and avoid the problem.

I should also point out that we stopped focusing on rr performance a while ago because we felt it was already good enough, not because we couldn't make it any faster. It can probably be improved further without much work.

Debugging with rr

The rr project landing page has a screencast illustrating the rr debugging experience. rr lets you use gdb to debug during replay. It's difficult to communicate the feeling of debugging with rr, but if you've ever done something wrong during debugging (e.g. stepped too far) and had that "crap! Now I have to start all over again" sinking feeling --- rr does away with that. Everything you already learned about the execution --- e.g. the addresses of the objects that matter --- remains valid when you start the next replay. Your understanding of the execution increases monotonically.

Limitations

rr has limitations. Some are inherent to its design, and some are fixable with more work.

  • rr emulates a single-core machine. So, parallel programs incur the slowdown of running on a single core. This is an inherent feature of the design. Practical low-overhead recording in a multicore setting requires hardware support; we hope that if rr becomes popular, it will motivate hardware vendors to add such support.
  • rr cannot record processes that share memory with processes outside the recording tree. This is an inherent feature of the design. rr automatically disables features such as X11 shared memory for recorded processes to avoid this problem.
  • For the same reason, rr tracees cannot use direct-rendering user-space GL drivers. To debug GL code under rr we'll need to find or create a proxy driver that doesn't share memory with the kernel (something like GLX).
  • rr requires a reasonably modern x86 CPU. It depends on certain performance counter features that are not available in older CPUs, or in ARM at all currently. rr works on Intel Ivy Bridge and Sandy Bridge microarchitectures. It doesn't currently work on Haswell and we're investigating how to fix that.
  • rr currently only supports x86 32-bit processes. (Porting to x86-64 should be straightforward but it's quite a lot of work.)
  • rr needs to explicitly support every system call executed by the recorded processes. It already supports a wide range of syscalls --- syscalls used by Firefox --- but running rr on more workloads will likely uncover more syscalls that need to be supported.
  • When used with gdb, rr does not provide the ability to call program functions from the debugger, nor does it provide hardware data breakpoints. These issues can be fixed with future work.

Conclusions

We believe rr is already a useful tool. I like to use it myself to debug Gecko bugs; in fact, it's the first "research" tool I've ever built that I like to use myself. If you debug Firefox at the C/C++ level on Linux you should probably try it out. We would love to have feedback --- or better still, contributions!

If you try to debug other large applications with rr, you will probably encounter rr bugs. Therefore we are not yet recommending rr for general-purpose C/C++ debugging. However, if rr interests you, please consider trying it out and reporting/fixing any bugs that you find.

We hope rr is a useful tool in itself, but we also see it as just a first step. rr+gdb is not the ultimate debugging experience (even if gdb's backtracing features get an rr-based backend, which I hope happens!). We have a lot of ideas for making vast improvements to the debugging experience by building on rr. I hope other people find rr useful as a building block for their ideas too.

I'd like to thank the people who've contributed to rr so far: Albert Noll, Nimrod Partush, Thomas Anderegg and Chris Jones --- and to Mozilla Research, especially Andreas Gal, for supporting this project.

Monday, 24 March 2014

Mozilla And The Silicon Valley Cartel

This article, while overblown (e.g. I don't think "no cold calls" agreements are much of a problem), is an interesting read, especially the court exhibits attached at the end. One interesting fact is that Mozilla does not appear on any of Google's do-not-call lists --- yay us! (I can confirm first-hand that Mozilla people have been relentlessly cold-emailed by Google over the years. They stopped contacting me after I told them not to even bother trying until they open a development office in New Zealand.)

Exhibit 1871 is also very interesting since it appears the trigger that caused Steve Jobs to fly off the handle (which in turn led to the collusion at the heart of the court case) was Ben Goodger referring members of the Safari team for recruitment by Google, at least one of whom was formerly from Mozilla.

Monday, 17 March 2014

Taroko National Park

This weekend, in between the media and layout work weeks, I had time off so I thought I'd get out of Taipei and see a different part of Taiwan. On Alfredo's recommendation I decided to visit Taroko National Park and do some hiking. I stayed at Taroko Lodge and had a great time --- Rihang Su was an excellent host, very friendly and helpful.

On Saturday I took the train from Songshan station in Taipei to Xincheng. From there I went via the lodge to the park's eastern headquarters and did a short afternoon hike up to near Dali village and back down again. This was short in distance --- about 3.5 km each way in two and half hours --- but reasonably hard work since it's about a 900m vertical climb.

On Saturday night Rihang took the lodge guests to Hualien city for dinner at a hotpot buffet place. The food was reasonable and it was a fun dinner. There were instructions in English and in general I was surprised by how much English signage there was in Hualien (including at the supermarket) --- I had expected none.

On Sunday (today) I got up fairly early for the main goal of my trip: Zhuilu Old Trail. Rihang dropped me off at the western end around 8:15am and I walked the 10km length in a bit under four hours. You're supposed to allow for six, but that's probably a bit conservative, plus I'm fairly fast. It is hard work and tricky in places, however. This trail is quite amazing. There's one stretch in particular where the path winds along the cliff face, mostly about a meter wide, and there's nothing between you and a sheer drop of hundreds of meters --- except a plastic-clad steel cable on the cliff side of the track to hold onto. It reminded me of the National Pass track in Australia's Blue Mountains, but more intense. Much of the track has spectacular views of the Taroko Gorge --- sheer cliffs, marble boulders, lush forest, and mountain streams. In one place I was lucky enough to see macaque monkeys, which was thrilling since I've never seen monkeys in the wild before. The trail has an interesting history; it was originally constructed by the Japanese military in the early 20th century (probably using slave labour, though it wasn't clear from the signage), to improve their control over the native people of the area. Later the Japanese turned it into a tourist destination and that's what it is today. Along the trail there are rest areas where fortified "police stations" used to be.

After finishing that trail, since I was ahead of schedule I continued with the Swallow Grotto walk along the road westward up the gorge. This walk was also amazing but offers quite different views since you're near the bottom of the gorge --- and it's much easier. The walk ends at the bottom of the Zhuilu cliff, which is neck-craningly massive. After that I got picked up and went to the train station to head straight back to Taipei.

This was definitely a very good way to see a bit of Taiwan that's not Taipei. Shuttling around between the train station, the lodge, and hiking destinations, I got a glimpse of what life is like in the region --- gardens, farming, stray dogs, decrepit buildings, industry, narrow roads, etc. But the natural beauty of Taroko gorge was definitely the highlight of the trip. I enjoy hiking with friends and family very much, but it's nice to hike alone for a change; I can go at my own pace, not worry about anyone else, and be a bit more relaxed. All in all it was a great weekend getaway.

Sunday, 16 March 2014

Maokong

I've been in Taiwan for a week. Last Sunday, almost immediately after checking into the hotel I went with a number of Mozilla people to Maokong Gondala and did a short hike around Maokong itself, to Yinhe Cave and back. I really enjoyed this trip. The gondala ride is quite long, and pretty. Maokong itself is devoted to the cultivation and vending of tea. I haven't seen a real tea plantation before, and I also got to see rice paddies up close, so this was quite interesting --- I'm fascinated by agricultural culture, which dominated so many people's lives for such a long time. Yinhe Cave is a cave behind a waterfall which has been fitted out as a Buddhist temple; quite an amazing spot. We capped off the afternoon by having dinner at a tea house in Maokong. A lot of the food was tea-themed --- deep-fried tea leaves, tea-flavoured omelette, etc. It was excellent, a really great destination for visitors to Taipei.

Monday, 10 March 2014

Introducing Chaos Mode

Some test failures are hard to reproduce. This is often because code (either tests or implementation code) makes unwarranted assumptions about the environment, assumptions that are violated nondeterministically. For example, a lot of tests have used setTimeout to schedule test code and assumed certain events will have happened before the timeout, which may not be true depending on effects such as network speeds and system load.

One way to make such bugs easier to reproduce is to intentionally exercise nondeterminism up to the limits of API contracts. For example, we can intentionally vary the actual time at which timers fire, to simulate the skew between CPU execution time and real time. To simulate different permitted thread schedules, we can assign random priorities to threads. Since hashtable iteration is not defined to have any particular order, we can make a hashtable iterator always start at a randomly chosen item.

I tried applying this to Gecko. I have patches that define a global "chaos mode" switch, and in several different places, if we're in chaos mode, we choose randomly between different valid behaviors of the code. Here's what the patches currently do:

  • Sometimes yield just before dispatching an XPCOM event. This gives another thread a chance to win an event-dispatch race.
  • On Linux, give threads a random priority and pin some threads to CPU 0 so they contend for CPU.
  • Insert sockets in random positions in the list of polled sockets, to effectively randomize the priority of sockets in poll results.
  • Similarly, when putting HTTP transactions into the HTTP transaction queue, randomly order them among other transactions with the same specified priority.
  • Start hashtable iteration at a random entry.
  • Scale timer firing times by random amounts (but don't vary the order in which timers fire, since that would violate the API contract).
  • Shuffle mochitests and reftests so they run in random order.

Note that it can be valuable to make a single random choice consistently (for the same object, thread, etc) rather than making lots of fine-grained random decisions. For example, giving a thread a fixed low priority will starve it of CPU which will likely cause more extreme behavior --- hopefully more buggy behavior --- than choosing a random thread to run in each time quantum.

One important source of nondeterminism in Gecko is XPCOM event (i.e. HTML5 task) dispatch. A lot of intermittent bugs are due to event timing and ordering. It would be nice to exploit this in chaos mode, e.g. by choosing the next event to fire randomly from the set of pending events instead of processing them in dispatch order. Unfortunately we can't do that because a lot of code depend on the API contract that firing order follows dispatch order. In general it's hard to determine what the valid alternative firing orders are; the first item on my list above is my best approximation at the moment.

Important Questions

Does this find bugs? Yes:

Which chaos features are the most helpful for producing test failures? I don't know. It would be a very interesting experiment to do try pushes with different patches enabled to figure out which ones are the most important.

Does it help reproduce known intermittent bugs? Sometimes. In bug 975931 there was an intermittent reftest failure I could not reproduce locally without chaos mode, but I could reproduce with chaos mode. On the other hand chaos mode did not help reproduce bug 791480. Extending chaos mode can improve this situation.

Isn't this just fault injection? It's similar to fault injection (e.g. random out-of-memory triggering) but different. With fault injection typically we expect most tests to fail because faults like OOM are not fully recoverable. Chaos mode should not affect any correctness properties of the program.

Wasn't this already done by <insert project name>? Probably. I don't claim this is a new idea.

When is this going to land and how do I turn it on? It has already landed. To turn it on, change isActive() to return true in mfbt/ChaosMode.h. Shuffling of reftests and mochitests has to be done separately.

OK, so this can trigger interesting bugs, but how do we debug them? Indeed, chaos mode makes normal debugging workflows worse by introducing more nondeterminism. We could try to modify chaos mode to reproduce the random number stream between runs but that's inadequate because other sources of nondeterminism would interfere with the order in which the random number stream is sampled. But we are working on a much better solution to debugging nondeterministic programs; I'll be saying more about that very soon!

My Linkedin Account Is Dead, And Why Is Google Being Stupid?

I tried to delete my Linkedin account a while back, but I still get a lot of "invitation to connect on Linkedin" emails. I plan to never connect to anyone on Linkedin ever again, so whoever wants to connect, please don't be offended when it doesn't happen --- it's not about you.

Linkedin, I despise you for your persistent spam.

PS, I'm visiting Taiwan at the moment and wondering why Google uses that as a cue to switch its Web interface to Chinese even when I'm logged into my regular Google account. Dear Google, surely it is not very likely that my change of location to Taiwan indicates I have suddenly learned Chinese and forgotten English.

Friday, 7 March 2014

Fine-Tuning Arguments

I've been doing a little background reading and stumbled over this very interesting summary of arguments for fine-tuning of the universe. The cosmology is a bit over my head but as far as I understand it, it matches other sources I've read. It doesn't reach any conclusions about what causes the fine-tuning but it does pose some interesting challenges to the multiverse explanation.

Wednesday, 5 March 2014

Internet Connectivity As A Geopolitical Tool

A lot of people are wondering what Western countries can do about Russian's invasion of Ukraine. One option I haven't seen anyone suggest is to disconnect Russia from the Internet. There are lots of ways this could be done, but the simplest would be for participating countries to compel their exchanges to drop packets sourced from Russian ISPs.

This tactic has several advantages. It's asymmetric --- hurts Russia a lot more than the rest of the world, because not many services used internationally are based in Russia. It's nonviolent. It can be implemented quickly and cheaply and reversed just as quickly. It's quite severe. It distributes pain over most of the population.

This may not be the right tactic for this situation, but it's a realistic option against most modern countries (other than the USA, for obvious reasons).

If used, it would have the side effect of encouraging people to stop depending on services outside their own country, which I think is no bad thing.

Sunday, 2 March 2014

Te Henga Walkway

On Saturday we finally did the Te Henga Walkway from Bethell's Beach to Goldie's Bush. I've wanted to do it for a long time but it's not a loop so you need multiple cars and a bit of planning. This weekend we finally got our act together with some friends and made it happen.

We dropped a car off at the end of Horseman Rd, at Goldie's Bush, and everyone continued to the start of the track at Bethell's Beach. The track goes over the hills with quite a bit of up-and-down walking but consistently excellent views of the ocean, beaches, cliffs and bush. It's a bit overgrown with gorse in places. After reaching Constable Rd, we walked along a bit and entered the west end of Goldie's Bush, walking through to the carpark on the other side. All the drivers got into our car and we dropped them off at the Bethell's end so they could bring their cars back to Horseman Rd to pick up everyone else.

Oddly, we were a bit slower than the nominal time for the Te Henga Walkway (signs say 3-4 hours, but we took a bit over 4 including our lunch break), but actually considerably faster than the nominal time for Goldie's Bush (signs say 2 hours, we took less than an hour). I would have expected to be slower on the later part of the walk when we were more tired.

Q&A Panel At ACPC This Friday

This Friday evening I'm part of the panel for an open Q&A session at Auckland Chinese Presbyterian Church. It should be a lot of fun!