Friday 14 September 2007
Parallel Browsing
My mate Ras Bodik has just produced a blog covering his new "parallel browsing" project. It's a great read, especially the slides (which don't display correctly in Mac's Preview, but do in Acrobat Reader). There's some great observations and ideas in there, plus some neat preliminary results about parallel lexing. The bottom line is that future generations of devices will need a web browser that can efficiently utilize tens to hundreds of CPU cores.
This is extremely challenging for many reasons:
- Existing browser engines are written in C++, which offers only a classic multithreaded parallel programming model: threads, shared memory, locks, semaphores, etc, plus hand-rolled lock-free primitives. These tools are very difficult to use and mostly don't scale very well. There is no direct support for optimistic concurrency.
- Amdahl's Law means that cherry-picking a few optimization opportunities won't be enough to scale to a large number of cores. We will have to work parallelism through every stage of the browser's processing.
- It's not at all clear how much of what the browser does can be parallelised. Ras' group's work on parallel lexing is a small first step. Can one write an HTML5 parser in parallel, and if so, how? How about JS/DOM manipulation? Layout? Rendering? The programming model we expose to Web developers is inherently single-threaded and I don't think we should change that.
- Massive changes to browser code (including rewriting it in a different language) are very risky, and can damage compatibility credibility.
- Our debugging and profiling tools are already crappy for mostly single-threaded code. Debugging and profiling parallel code will be much harder.
- Browsers rely on a lot of platform library code. That code also needs to be refitted for scalable parallelism or it will become a bottleneck.
Bottom line: writing a good browser engine is already really hard. Requiring scalable parallelism adds a whole new dimension of engineering difficulty, and poses a bunch of research-level questions which we don't yet have answers for.
I plan to blog about my thoughts on these issues and how they might be solved. But my impression is that it will be very difficult for existing browsers to move into multicore territory, due to language choices, the difficulty of refitting existing code, compatibility with existing code consumers, a desire to not compromise single-core performance, and the enormous risk of a large investment in a long development cycle with no guarantee of success.
I believe there is a huge disruption opportunity here: an effort focused on building a new parallel browser, with no interest in single-core performance, which can rethink all engineering choices, could start by filling a niche and eventually dominate. It could also flame out spectacularly! It would be great territory for a startup, if VCs weren't obsessed with creating the next great social networking site. It's good territory for research --- Ras is on the money. I do hope that if such a disruptive effort wins, the code is open source. I'm not sure what can be done to help ensure that.
Comments
For that matter, the JS engine alone isn't necessarily as important as the DOM bindings. I think SpiderMonkey is the fastest JS engine when benchmarked standalone, but in several cases on the web Safari and Opera get wins from lower-overhead DOM bindings. Maybe this would change, though, if people had a "super-mega-fast-quick" engine, in that it'd become more performant to write calculation-heavy JS code if it could save some DOM manipulation? Not sure how that would work though.
Changing topics... roc, in the abstract, do you think a bigger multicore win would come from somehow parallelizing each chunk of the rendering engine (a parallelized parser, layout, rendering, etc.), or would it be better to basically keep a sort of "single core engine" and simply have many instances of that (say, one per page, or per group of pages, or something)? I'm aware that both methods cause thorny problems for scripts, are hard (if not impossible) to retrofit into Gecko, etc.
My take on the question is that it probably depends on the usage model. If most people use one or few tabs/windows, then the first route, if possible, would be better, because a single visible page is no better in the second route than today's engines. If people are using lots of tabs, the second route might be easier to write; I'm not sure how I would parallelize some of the pieces of the engine, and if there are still significant bottlenecks, the first route is not going to be much of a win.
I find the question interesting from my years working on compilers... many multicore designs were touted as being fast when developers could rely on the toolchain to "do the hard work", and I found it amusing that we compiler authors didn't know how to solve the parallelization problems better than anyone else.
http://weblogs.mozillazine.org/roadmap/archives/2007/02/
Why fiddle with something that works OK? Where are these real world applications where the JS engine is the bottleneck?
Lets focus on the SVG animation/timeline code. That is the new web - HTML is old skool. Silverlight + C# + millions of developers and students and high school kids programming C# and silverlight IDE = death of open standards.
There is a Flash movie on every page of the internet at the moment - why aren't we doing anything about this?
monk.e.boy
Peter: I think Tamarin needs work to cope better with untyped Javascript that you find on the Web, and it's premature to conclude that large speedups cannot be obtained. But anyway, that is a bit off topic...
Your question about parallelizing within pages vs across pages is interesting, and I'm planning to write about that on my blog. I think that basically people can only pay attention to one page at a time, so parallelizing across pages is not very interesting.
I snickered when people put their faith in compilers to fix Itanium's performance problems, and I don't intend to commit the same mistake myself. But I think there are improvements in programming languages that could really help without requiring extraordinary effort in the compiler.
"why aren't we doing anything about this?" We are. We're working hard on SVG, cairo, browser-based video and related issues.
Silverlight is a concern, definitely, but there's no need to panic just yet. It will be long time before Silverlight gets enough client penetration for people to rely on it. Lots of people are learning C#, but lots are also learning Java and Javascript, and dynamic languages (like JS, not C#) are rising in popularity.
Rest assured, no-one out-paranoids me!
Roc, I dare to disagree, since some apps eat all CPU time even when in the background. That's exactly the user case the separately processed threads would benefit from at most.
I do think it is interesting. Most of the pages I visit will load in a background tab, due to the fact that I mostly middle-click on links. Thus, it happens to me several times a day that the active tab freezes because some page in the background does some heavy JavaScript computation.
Were tabs rendered concurrently, I could still interact with the active tab (i.e. scroll, click links) without having to wait for pages I can't even see yet.
On another note: If Erlang can create extremely lightweight "threads" efficiently - why can't be something written with C++ that is able to do the same? (assuming these threads were without shared state, side-effects, etc.)
Sjoerd: just saying "we'll assign one thread to this, and one thread to that" does not solve any of the hard problems. The hard problems are about how the interactions between concurrently executing threads are managed.
funTomas, Markus: if a background application is consuming tons of CPU time and interfering with my use of a foreground application, then I consider that a bug in the background application.
It's really a bug in Firefox that the load and layout of a large background page can block interaction with the foreground page. And it's a bug that can be fixed without requiring multiple cores or even multiple threads.
Markus: the problem is much much bigger than "creating lightweight threads efficiently". It's frustrating that so many people seem to think that Erlang has solved all the problems of concurrency and parallelism. It most certainly has not --- do you think all the researchers working in this area are just stupid? I guess I'll have to write a whole blog entry just to debunk this myth.
That's really nice to hear.
> It's frustrating that so many people seem to think that Erlang has solved all the problems of concurrency and parallelism.
That's not what I said. Of course Erlang hasn't solved it all. However, large thread creation overhead is certainly part of the problem when it comes to concurrent programming. What I'd like to know is: Why can't C++ do what Erlang can? Does it need some kind of low-level access that can only be provided by the compiler?
Pardon my ignorance.
> I guess I'll have to write a whole blog entry just to debunk this myth.
I'd love to read that one.
Reducing per-thread stack usage is hard, and it would require modifications to the C++ compiler to make get heap-allocated stack records or whatever Erlang does, which would hurt performance in some ways. But that's not the hard part of parallelism either.
/be
I'm not a computer scientist, but I do spend a lot of time writing performance critical code, and have done a little work on MPI/OpenMPI, and I've been following the discussion around SMP support in the FreeBSD kernel...
The main tool needed is a good compiler/IDE: one that can take code (especially existing code in C++) and turn it into a syntax tree that can be stored rather than the raw language (with enough 'sugar' that a language can be reconstructed - with the preferred formatting of the user). It should also store the optimization results (including auto-parallelizing the code), including the 'optimized' code, and places where it failed to optimize for some reason. Then it could present you with a GUI IDE where you could look at the code, and then switch over to an 'optimized' view, to see what it did, or to show you why it could not optimize code (for example not being able to determine that pointers/references are non-overlapping, so not being able to do things in parallel, or places where it has to issue cache flushes).
It also needs a profiler, or something to help you track down the slow parts of your code that are on the critical path and point them out to you, so you can spend time tweaking them. The problem here is that the performance is very dependent on the CPU, memory, kernel, threads, malloc, etc, etc. So the system really needs be fairly integrated with the entire environment, so that it knows that issuing an atomic write will stall all the CPUs and flush their caches (for example), and that the performance penalty from that should not be attached to the code running on the other CPUs but to the code which did the write - this would mean needing to do some sort of 'instruction cost' profiling rather than just time based. Profiling in a VM or a chronicle type profiler might be good for this.
These wont help you with high level parallelism, but then not much other than being super smart will. But these would help you to get that working - and show you where what looks like a simple instruction actually will stall the entire application, or where you have lots of lock contention.
I think that there's a bit more time before this all happens than Intel would like us to believe - simply because the tools are not there. Kernels struggle to scale past 8 cores, unless they're doing something special. Most of the real world performance from multi-cores still comes from 'start one process per processor' designs, and that's not much use on the desktop. Lots of desktop time is spent in the GPU, and GPU manufactures are not showing any interest in helping speed 2D rendering (remember when ATI put crystal fonts - an accelerate font renderer - on their cards). Basically, in the desktop market people are going to go into Walmart, play with the mouse a bit, open up some windows, and see that the 2-core machine is the same speed as the 8-core machine, and cheaper. People who care about performance (gamers and rocket scientists - in that order ;-) already test the heck out of their systems, and can see that 4-core machines are bottle necking in all sorts of other places.
Regards,
-Jeremy
http://www.stackless.com/
There even is a MMOG already with "hundreds of thousands" of tasklets (EVE Online).
http://www.tentonhammer.com/node/10044
And there are of course many more advantages of using a high-level language... ;)