Friday, 14 September 2007

Parallel Browsing

My mate Ras Bodik has just produced a blog covering his new "parallel browsing" project. It's a great read, especially the slides (which don't display correctly in Mac's Preview, but do in Acrobat Reader). There's some great observations and ideas in there, plus some neat preliminary results about parallel lexing. The bottom line is that future generations of devices will need a web browser that can efficiently utilize tens to hundreds of CPU cores.

This is extremely challenging for many reasons:


  • Existing browser engines are written in C++, which offers only a classic multithreaded parallel programming model: threads, shared memory, locks, semaphores, etc, plus hand-rolled lock-free primitives. These tools are very difficult to use and mostly don't scale very well. There is no direct support for optimistic concurrency.
  • Amdahl's Law means that cherry-picking a few optimization opportunities won't be enough to scale to a large number of cores. We will have to work parallelism through every stage of the browser's processing.
  • It's not at all clear how much of what the browser does can be parallelised. Ras' group's work on parallel lexing is a small first step. Can one write an HTML5 parser in parallel, and if so, how? How about JS/DOM manipulation? Layout? Rendering? The programming model we expose to Web developers is inherently single-threaded and I don't think we should change that.
  • Massive changes to browser code (including rewriting it in a different language) are very risky, and can damage compatibility credibility.
  • Our debugging and profiling tools are already crappy for mostly single-threaded code. Debugging and profiling parallel code will be much harder.
  • Browsers rely on a lot of platform library code. That code also needs to be refitted for scalable parallelism or it will become a bottleneck.

Bottom line: writing a good browser engine is already really hard. Requiring scalable parallelism adds a whole new dimension of engineering difficulty, and poses a bunch of research-level questions which we don't yet have answers for.

I plan to blog about my thoughts on these issues and how they might be solved. But my impression is that it will be very difficult for existing browsers to move into multicore territory, due to language choices, the difficulty of refitting existing code, compatibility with existing code consumers, a desire to not compromise single-core performance, and the enormous risk of a large investment in a long development cycle with no guarantee of success.

I believe there is a huge disruption opportunity here: an effort focused on building a new parallel browser, with no interest in single-core performance, which can rethink all engineering choices, could start by filling a niche and eventually dominate. It could also flame out spectacularly! It would be great territory for a startup, if VCs weren't obsessed with creating the next great social networking site. It's good territory for research --- Ras is on the money. I do hope that if such a disruptive effort wins, the code is open source. I'm not sure what can be done to help ensure that.



19 comments:

  1. My main beef with current browsers is that they lack a super-mega-fast-quick JS engine. I think FireFox 3 is supposed to use Adobe's engine, but I don't know if it's multithreaded, etc.

    ReplyDelete
  2. Robert O'Callahan15 September 2007 01:31

    Firefox 3 won't use Adobe's engine. Some future version of Firefox will. But that doesn't mean Javascript code will automatically just get faster as we add more cores. We could expose some kind of Javascript concurrency to Web developers, but that seems perilous and won't help with the masses of existing code.

    ReplyDelete
  3. Tamarin (Adobe's engine) isn't all that great shakes anyway. It's not like it's an order of magnitude faster than SpiderMonkey -- even in the best case it's something like a factor of 2 (or maybe 3?) better, and for the JS commonly found on the web (not compiled ActionScript, or whatever) my impression was that it is basically a wash at best. I could be wrong in all this, but so far all the outside-looking-in watching I've done of the Tamarin announcements has left me less than impressed.
    For that matter, the JS engine alone isn't necessarily as important as the DOM bindings. I think SpiderMonkey is the fastest JS engine when benchmarked standalone, but in several cases on the web Safari and Opera get wins from lower-overhead DOM bindings. Maybe this would change, though, if people had a "super-mega-fast-quick" engine, in that it'd become more performant to write calculation-heavy JS code if it could save some DOM manipulation? Not sure how that would work though.
    Changing topics... roc, in the abstract, do you think a bigger multicore win would come from somehow parallelizing each chunk of the rendering engine (a parallelized parser, layout, rendering, etc.), or would it be better to basically keep a sort of "single core engine" and simply have many instances of that (say, one per page, or per group of pages, or something)? I'm aware that both methods cause thorny problems for scripts, are hard (if not impossible) to retrofit into Gecko, etc.
    My take on the question is that it probably depends on the usage model. If most people use one or few tabs/windows, then the first route, if possible, would be better, because a single visible page is no better in the second route than today's engines. If people are using lots of tabs, the second route might be easier to write; I'm not sure how I would parallelize some of the pieces of the engine, and if there are still significant bottlenecks, the first route is not going to be much of a win.
    I find the question interesting from my years working on compilers... many multicore designs were touted as being fast when developers could rely on the toolchain to "do the hard work", and I found it amusing that we compiler authors didn't know how to solve the parallelization problems better than anyone else.

    ReplyDelete
  4. I'm not sure if you have seen this before or not. Here is Brendan Eich's thoughts on concurrency in JS. The comments go into more details, but he seemed to be leaning towards building a message passing system into JS3.
    http://weblogs.mozillazine.org/roadmap/archives/2007/02/

    ReplyDelete
  5. But JS is fast enough really - or are we talking JS + SVG + Animation?
    Why fiddle with something that works OK? Where are these real world applications where the JS engine is the bottleneck?
    Lets focus on the SVG animation/timeline code. That is the new web - HTML is old skool. Silverlight + C# + millions of developers and students and high school kids programming C# and silverlight IDE = death of open standards.
    There is a Flash movie on every page of the internet at the moment - why aren't we doing anything about this?
    monk.e.boy

    ReplyDelete
  6. Robert O'Callahan15 September 2007 09:32

    Matt: I've read that, yes.
    Peter: I think Tamarin needs work to cope better with untyped Javascript that you find on the Web, and it's premature to conclude that large speedups cannot be obtained. But anyway, that is a bit off topic...
    Your question about parallelizing within pages vs across pages is interesting, and I'm planning to write about that on my blog. I think that basically people can only pay attention to one page at a time, so parallelizing across pages is not very interesting.
    I snickered when people put their faith in compilers to fix Itanium's performance problems, and I don't intend to commit the same mistake myself. But I think there are improvements in programming languages that could really help without requiring extraordinary effort in the compiler.

    ReplyDelete
  7. Robert O'Callahan15 September 2007 09:42

    monk.e.boy: JS is not fast enough, people are complaining about it already. And there are applications that are not being built because JS is not fast enough to run them. And we are talking about moving to a world where each core is significantly slower than today's desktop/laptop CPUs, so if we're not careful things will actually get *worse*. Benchmarked JS on an iPhone lately? Read Ras' slides.
    "why aren't we doing anything about this?" We are. We're working hard on SVG, cairo, browser-based video and related issues.
    Silverlight is a concern, definitely, but there's no need to panic just yet. It will be long time before Silverlight gets enough client penetration for people to rely on it. Lots of people are learning C#, but lots are also learning Java and Javascript, and dynamic languages (like JS, not C#) are rising in popularity.
    Rest assured, no-one out-paranoids me!

    ReplyDelete
  8. What about simply assigning one processor for each tab, and one for the UI. Even with 2 processors it would be great if the UI would get its own processor, so a number crunching page wouldn't lock the UI.

    ReplyDelete
  9. "I think that basically people can only pay attention to one page at a time, so parallelizing across pages is not very interesting."
    Roc, I dare to disagree, since some apps eat all CPU time even when in the background. That's exactly the user case the separately processed threads would benefit from at most.

    ReplyDelete
  10. The Linux RCU-style locking scheme applied to desktop could be interesting (like they apparantly do in http://osnews.com/comment.php?news_id=18589)

    ReplyDelete
  11. > I think that basically people can only pay attention to one page at a time, so parallelizing across pages is not very interesting.
    I do think it is interesting. Most of the pages I visit will load in a background tab, due to the fact that I mostly middle-click on links. Thus, it happens to me several times a day that the active tab freezes because some page in the background does some heavy JavaScript computation.
    Were tabs rendered concurrently, I could still interact with the active tab (i.e. scroll, click links) without having to wait for pages I can't even see yet.
    On another note: If Erlang can create extremely lightweight "threads" efficiently - why can't be something written with C++ that is able to do the same? (assuming these threads were without shared state, side-effects, etc.)

    ReplyDelete
  12. Robert O'Callahan15 September 2007 23:25

    Diego: that snapshot idea is a good one, it's not the same as RCU however. I had been thinking about lightweight snapshots as one of the things that would be needed to parallelize browsers.
    Sjoerd: just saying "we'll assign one thread to this, and one thread to that" does not solve any of the hard problems. The hard problems are about how the interactions between concurrently executing threads are managed.
    funTomas, Markus: if a background application is consuming tons of CPU time and interfering with my use of a foreground application, then I consider that a bug in the background application.
    It's really a bug in Firefox that the load and layout of a large background page can block interaction with the foreground page. And it's a bug that can be fixed without requiring multiple cores or even multiple threads.
    Markus: the problem is much much bigger than "creating lightweight threads efficiently". It's frustrating that so many people seem to think that Erlang has solved all the problems of concurrency and parallelism. It most certainly has not --- do you think all the researchers working in this area are just stupid? I guess I'll have to write a whole blog entry just to debunk this myth.

    ReplyDelete
  13. > And it's a bug that can be fixed without requiring multiple cores or even multiple threads.
    That's really nice to hear.
    > It's frustrating that so many people seem to think that Erlang has solved all the problems of concurrency and parallelism.
    That's not what I said. Of course Erlang hasn't solved it all. However, large thread creation overhead is certainly part of the problem when it comes to concurrent programming. What I'd like to know is: Why can't C++ do what Erlang can? Does it need some kind of low-level access that can only be provided by the compiler?
    Pardon my ignorance.
    > I guess I'll have to write a whole blog entry just to debunk this myth.
    I'd love to read that one.

    ReplyDelete
  14. Robert O'Callahan16 September 2007 11:39

    Thread creation overhead is really not an issue at all. It is really simple to use thread pools to make thread creation cost a non-issue.
    Reducing per-thread stack usage is hard, and it would require modifications to the C++ compiler to make get heap-allocated stack records or whatever Erlang does, which would hurt performance in some ways. But that's not the hard part of parallelism either.

    ReplyDelete
  15. Thank you for clearing that up. It seems like I have to do some reading on this.

    ReplyDelete
  16. See https://bugzilla.mozilla.org/show_bug.cgi?id=multicore where I write similar words to roc's here: responsive UI does *not* require multiple threads or multiple core. We should fix responsiveness bugs in Firefox. We should not reach for conventional shared-memory threads to solve problems they do not address, when threads have big safety and scaling problems of their own.
    /be

    ReplyDelete
  17. I don't know that the real opportunity here is in a parallel browser, as such, although it is such a vital piece of software that it is important. The real opportunity lies with the tools that one would need to build in order to build such a browser. Just looking over your post, most of the problems you mention are not problems with a parallel browser - they are that the tools don't help you - they get in your way...
    I'm not a computer scientist, but I do spend a lot of time writing performance critical code, and have done a little work on MPI/OpenMPI, and I've been following the discussion around SMP support in the FreeBSD kernel...
    The main tool needed is a good compiler/IDE: one that can take code (especially existing code in C++) and turn it into a syntax tree that can be stored rather than the raw language (with enough 'sugar' that a language can be reconstructed - with the preferred formatting of the user). It should also store the optimization results (including auto-parallelizing the code), including the 'optimized' code, and places where it failed to optimize for some reason. Then it could present you with a GUI IDE where you could look at the code, and then switch over to an 'optimized' view, to see what it did, or to show you why it could not optimize code (for example not being able to determine that pointers/references are non-overlapping, so not being able to do things in parallel, or places where it has to issue cache flushes).
    It also needs a profiler, or something to help you track down the slow parts of your code that are on the critical path and point them out to you, so you can spend time tweaking them. The problem here is that the performance is very dependent on the CPU, memory, kernel, threads, malloc, etc, etc. So the system really needs be fairly integrated with the entire environment, so that it knows that issuing an atomic write will stall all the CPUs and flush their caches (for example), and that the performance penalty from that should not be attached to the code running on the other CPUs but to the code which did the write - this would mean needing to do some sort of 'instruction cost' profiling rather than just time based. Profiling in a VM or a chronicle type profiler might be good for this.
    These wont help you with high level parallelism, but then not much other than being super smart will. But these would help you to get that working - and show you where what looks like a simple instruction actually will stall the entire application, or where you have lots of lock contention.
    I think that there's a bit more time before this all happens than Intel would like us to believe - simply because the tools are not there. Kernels struggle to scale past 8 cores, unless they're doing something special. Most of the real world performance from multi-cores still comes from 'start one process per processor' designs, and that's not much use on the desktop. Lots of desktop time is spent in the GPU, and GPU manufactures are not showing any interest in helping speed 2D rendering (remember when ATI put crystal fonts - an accelerate font renderer - on their cards). Basically, in the desktop market people are going to go into Walmart, play with the mouse a bit, open up some windows, and see that the 2-core machine is the same speed as the 8-core machine, and cheaper. People who care about performance (gamers and rocket scientists - in that order ;-) already test the heck out of their systems, and can see that 4-core machines are bottle necking in all sorts of other places.
    Regards,
    -Jeremy

    ReplyDelete
  18. If I hear about a high amount of parallel tasks, Stackless Python comes to my mind:
    http://www.stackless.com/
    There even is a MMOG already with "hundreds of thousands" of tasklets (EVE Online).
    http://www.tentonhammer.com/node/10044
    And there are of course many more advantages of using a high-level language... ;)

    ReplyDelete
  19. Robert O'Callahan27 September 2007 11:45

    Stackless Python doesn't have anything to do with parallelism.

    ReplyDelete