Friday, 16 December 2005

Night Of The Living Threads

Every so often someone encounters a temporary hang in the Firefox UI or an extension and declares that the solution is to use more threads. They are wrong.

The standard threads-and-locks shared-memory programming model that everyone uses today just sucks. Large numbers of possible interleavings make it difficult for programmers to reason about, and therefore code is prone to deadlocks and race conditions ... catastrophic bugs that are often hard to reproduce, diagnose, and fix. Threads-and-locks forces you to violate abstraction and/or design very complicated specifications for how synchronization works across modules. It is very difficult to get good performance; locking schemes don't scale well, and locking has considerable overhead. Maurice Herlihy says it all better than I can.

Many efforts have been made over decades to design better programming models for threads, including one that I was recently part of at IBM. You can make life better by restricting the possible interleavings or providing transactional semantics. These are promising but are not yet available in forms we can use. Even when they become available, they still add complexity over a single-threaded programming model, and fail to solve some important problems. For example, safely killing a running, non-cooperating thread is terribly difficult to get right in every system I know of.

It's important to recognize that threads solve two different problems:

  • Allowing asynchronous execution so one long-running activity does not block another activity
  • Allowing concurrent execution of multiple activities, to take advantage of multiple CPU cores

The problems that people complain about today in Firefox are entirely of the first kind. These problems can be solved without using threads. Here are some specific examples:

  • Loading a big page hangs the browser while we lay out the page. The solution here is to make it possible to interrupt the reflow, process UI and other activity, and then resume layout. This should not be very hard because we support incremental reflow already.
  • The UI hangs while some extension does I/O. The solution here is for the extension to use asynchronous I/O.
  • The UI hangs while we instantiate a plugin. The solution here is to put plugins in their own processes and communicate with them over a channel which we don't block on.

The second problem, taking advantage of multiple CPU cores, is not so easy to get around, because threads are exactly what CPUs provide (today). Therefore we will end up using multiple threads inside Gecko --- carefully, only where it makes sense because we can get significant performance benefits without great complexity; e.g., page drawing should be fairly easy to parallelize. But I will fight tooth and nail to avoid exposing threads to Web/extension developers.


  1. In my opinion, the main problem that needs to be worked around occurs in Thunderbird. With many RSS subscriptions, Thunderbird assumes 100% of the CPU capacity for at least 5 seconds. This is frustrating as it not only prevents access to other areas of Thunderbird (UI etc.) but to other running programs too. This is a 'type 1' problem as you put it. I assume it can be fixed by placing interrupts/resumes every so often. Good post anyway, threads are not always the answer.

  2. Ah, threads. My Gentoo Firefox 1.0.7-r2 on AMD64 crashes regularly (up to 10 times a day!!!) due to what I believe must be a race condition, combined with a reference counting bug. I know where it crashes: the reference count of an event queue drops to zero, destructing the queue, but aborting because it attempts to dispatch the remaining events from the wrong thread.
    But where does it drop the count to 1? It's not in the code where it crashes, I looked very carefully.
    To make things worse, the problem is completely irreproducible on 32-bit builds for some reason. It's also unpredictable: while it always appears on the start of a page load sequence (i.e. I click on a link, and it crashes about when the first server response comes), there are no circumstances I can find that make it happen - no conditions I can test on.
    And since there isn't even a bug or anything filed, I have to assume that the problem only occurs for my own build - and not being an official build, nor talkback-enabled, the chances of getting help are zero.
    So yeah, threading sucks ;)
    Sorry for the rant, I had to do it somewhere.

  3. I've found that the transactional semantics seem to be the easiest abstraction to work with -- composable, dead-lock free. Performance can be an issue, but with some annotations to the compiler, not impossible to solve (and in something like Haskell, no annotations are needed). Async IO just seems like a poor man's lightweight threads -- you want the concurrency, but are using a bad abstraction to achieve it; OS-level threading is clearly too heavyweight on most architectures, even Linux. I agree that right now, using threads in things is a bad idea, even in Java or C#, which supposedly support them at the VM level with linguistic help. The day will come, however, that using abstractions that expose concurrency will be preferrable to trying to hide that explicit concurrency with things like async IO or callbacks.

  4. Trond Danielsen18 December 2005 13:28

    This problem has been addressed by programmers of real-time and reliable systems for more than 20 years. Take a look at the ideas behind occam2 and the transputer which date back to the early 80's, wikipedia will tell you all about it. Even microsoft has resently published some very interesting work that has been done in this field:
    Lets hope that these ideas and methods will finally make their way into the desktop to produce more reliable and faster software.

  5. If you're doing a lot of processing in a JS extension is there a good way to prevent blocking UI (I think you can keep calling setTimeout but that doesn't seem like a very nice solution)?

  6. Have a look at Twisted; or its Javascript counterpart MochiKit.

  7. Threads are certainly the cause of a lot of headaches: that's certain :-). However, I think the reaction to "not use them" is still the wrong one, and I think your arguments and those in the microsoft presententation aren't convincing. First of all; the "cost" of locking or OS threads is actually not high at all; it's pretty much free: We're not developing high-throughput apps; but low-latency apps here. CPU cycles tend not to be expensive, but performing the work at the right time is the issue. When locking costs, and context switches actually are the majority of the *wait* experienced by a user, as opposed to the CPU time spent (which I for the sake of the argument don't care about), then it might be time to look at those costs.
    The presentation gives the interesting example of the double-ended queue. This illustrates their point (and yours I guess) perfectly: a good solution is hard, and basically not worth it for a program like firefox (too likely something will break, bugs are just not worth it). However, the solution they mark as "too conservative", namely the single large lock, actually sounds fine to me. I'ld say, use threading liberally, but with corse grained, safe (as in simple) locking.
    You're basically doing this by using async IO, and there are a whole lot of other cases in which threading is 100% safe as it benefits functions which don't touch any external variables at all (or only simple global variables which remain essentially constant).
    Threading really becomes an issue when you actually have to think about it: if your code actually has to be thread aware, then yes, it might be wise to avoid threads altogether. However, with a little bit of thought it might be possible to use threads usefully for latency reduction without many of the headaches.
    Anyhow, my take on threads ;-)

  8. I found this very interesting but you should make the text on your blog darker. There's not enough contrast against the white background to make reading it confortable. And comment previewing seems to forget the comment content.

  9. If locking is becoming a problem, the solution isn't fewer threads, it's less shared data.
    So I have about 20 bookmarks in a folder, and every day, I open it in tabs. For about 30 seconds after that, even though the first page is fully loaded, scrolling and closing tabs is extremely laggy. So either the tabs don't have their own thread (they ought to), the priority isn't increased for selected tabs (it should be), or there's way to much sharing of data going on.