Wednesday, 26 September 2007

Parallel DOM Access

Suppose we want to exploit parallelism to load and run an existing Web application faster. How might that be done?

Here's a very high level diagram of the data flow inside the browser. I believe it's reasonably accurate for Webkit as well as Gecko.

There are some missing edges in some cases; for example, the style system can affect the DOM via XBL. I'm going to gloss over those for now.

Observe that the DOM is really the center of the universe. It's a big blob of mutable state that everything else depends on. Any parallelization of single-page applications is going to have to carefully manage access to the DOM.

For performance reasons, changes to the DOM are lazily propagated to the style system, layout and rendering. The goal is to batch changes so that these phases are not reexecuted too often --- they're expensive, even when incremental. DOM updates make method calls to registered listeners. Listeners record which styles might have been affected and post a restyle event (if one hasn't already been queued) to be processed later. Similarly, they set dirty bits in the frame tree and then post a reflow event. On the other side, the parser batches its appends to the DOM as well. It's primarily only script execution that makes many fine-grained, interleaved reads and writes.

[There are times when scripts need access to up-to-date style and layout data. In Gecko, at those times we invoke FlushPendingNotifications to force synchronous restyling and layout. Because this is already a significant performance hit, Web developers should already be trying to avoid triggering it (e.g., by accessing DOM offset* properties).]

I think we could leverage these access patterns to help parallelize browser execution:

  • Parsing can be mostly asynchronous; it should only need to synchronize with the DOM when it's ready to append a bunch of elements.
  • Styling, layout and rendering can take a snapshot of the DOM and use it while the real DOM is being updated by other tasks. It's OK for the display of a page to lag slightly behind; it already does today. The snapshot will be refreshed each time we need to update the window again, or when layout results must be return synchronously to scripts.
  • To run scripts, such as JS event handlers, in parallel, we could use optimistic concurrency via transactions. The idea is that we run a script speculatively against a snapshot of the DOM, recording which parts of the DOM it reads and suppressing its changes to the state of the DOM and other parts of the world. When the script has completed, we synchronize with the real DOM and check whether anything the script depended on has changed. If not, then we can commit the script's global changes. Otherwise, its execution is invalid; we throw out its changes and reexecute it all over again.
  • Some DOM-accessing code cannot be executed speculatively, such as plugins, or Javascript that takes actions that cannot be rolled back, such as sending a message to a server. (At CMU, the canonical example of an irrevocable message was "fire missile!") Therefore at any point in time there can be (at most) one thread accessing the "real DOM" which is not speculative and will never be asked to abort.

This model would be particularly useful for Firefox where extensions written in Javascript have access to the DOMs of all browsed pages. We could allow extensions to execute in parallel with client code as long as at least one of them is speculative.

None of this is easy to implement. Making real snapshots of the DOM would be prohibitively expensive, so some kind of versioning or copy-on-write scheme is required. Which state constitutes "the DOM" that needs versioning is actually a little nebulous; for example, would we version the state of <canvas> elements? How about DOM Storage or the proposed DOM SQL API? However, the good news is that some of this can be done incrementally. For example, we could start with a limited slice of DOM state that is versioned, and decree that speculative script accessing any other state is aborted, and then gradually increase the scope of versioning.

There's a lot more to think and talk about in this area.

Tuesday, 25 September 2007


Some of the last major pieces of text work for Gecko 1.9/Firefox 3 are in place. Several days ago I landed a patch that sets text-rendering:optimizeLegibility for text inputs and XUL root elements (by default, it will inherit to the descendants of XUL root elements too). This means we automatically get high-quality mode (ligatures, kerning etc) for text in XUL documents (e.g., XUL application UI) and text inputs.

Today I landed another patch to compute accurate glyph bounds for all text in high-quality mode. This means dynamic changes to such text should redraw correctly even for fonts with glyphs that extend outrageously far outside the font ascent, descent, etc. Here's a demo ... this textarea requests Zapfino and DejaVu Sans; so it will work best on Macs, which have Zapfino preinstalled, but there should also be an "fi" ligature on Linux systems that have a recent version of DejaVu Sans installed. (I'm not sure what fonts to request for Windows, please advise!) Try selecting and editing around a ligature. Of course you will need a Firefox trunk build as of a few hours ago. If you have Zapfino, for a good time try entering "Zapfino" :-). It might not work in feeds or other contexts that sanitize content.

Ideally, we would enable accurate glyph bounds for all text, not just high-quality text, because without those bounds there can be repaint bugs even in "normal" text (e.g., the tail of italic j in many fonts extends beyond the left edge of the character cell). What we're doing here is a compromise so that we don't eat the performance cost of getting glyph extents for all text. (See my Glyph Bounds Problem post for more details.) However, at some point we should do some measurements to see exactly what the performance cost of turning it on always would be.

Similarly, we should also measure the performance cost of turning on high-quality text rendering always.

Update Added Segoe Script to the textarea for Vista users.

Monday, 24 September 2007

My World Cup Predictions

Based on the All Blacks' games so far:

  1. All Blacks lose to France in the quarter-final; a high error rate snuffs out promising attacks and France score intercept tries.
  2. The international press, who have been praising the All Blacks to the skies, explain how over-rated they always were.
  3. The rest of the rugby world learns that the way to ensure the ABs do not win the World Cup is to never, ever play your A team against them except in the knockout stages of the WC.
  4. The press gloats about how the ABs never win the WC, which is the only contest that matters. At the same time, the press complains that no-one takes non-WC games seriously anymore.
  5. The press complains that conservative, keep-it-tight rugby is boring. At the same time, the press ridicules the ABs for playing a wide game, which everyone knows is not the way to win the World Cup (the only contest that matters, see above).

Monday, 17 September 2007

Why Erlang Is Not The (Whole) Answer

When discussing the problems of concurrency and parallelism, remarkably often someone pipes up with "You should just use Erlang!" This is frustrating.

I think by "Erlang" they mainly mean no-shared-memory, message passing processes. That is definitely a good way to structure many systems --- it avoids the pit of all evil, concurrency with unrestricted shared memory. But for many problems it is not a good fit. Many problems are fundamentally about concurrent updates to shared state, for example:

  1. Transfers between bank accounts
  2. Object interactions in a virtual world
  3. Scripts manipulating the DOM in a parallel browser

The right abstraction for these problems is atomic updates, a.k.a. transactions. In each transaction we want to read multiple values, do some computation, and update multiple values. We want the program behaviour to be as if transactions happen one at a time (atomicity), but we actually want updates to execute in parallel when they do not interfere.

This does not map well to pure message-passing systems in general. You can implement ad-hoc solutions to these transactional problems in most languages, including Erlang. You can even implement some sort of transactional memory in Erlang. But the language --- message passing in particular --- isn't helping you here. To cleanly express solutions to these problems we need a language (or if we're lucky, a library) which can directly describe shared state and atomicity, and implement them with some degree of parallelism.

Transactions are no panacea either! The point is just that message passing is often not the best way to express concurrency, and therefore the ideal system for parallel programming will contain more concurrency features than just message passing.

Friday, 14 September 2007

Parallel Browsing

My mate Ras Bodik has just produced a blog covering his new "parallel browsing" project. It's a great read, especially the slides (which don't display correctly in Mac's Preview, but do in Acrobat Reader). There's some great observations and ideas in there, plus some neat preliminary results about parallel lexing. The bottom line is that future generations of devices will need a web browser that can efficiently utilize tens to hundreds of CPU cores.

This is extremely challenging for many reasons:

  • Existing browser engines are written in C++, which offers only a classic multithreaded parallel programming model: threads, shared memory, locks, semaphores, etc, plus hand-rolled lock-free primitives. These tools are very difficult to use and mostly don't scale very well. There is no direct support for optimistic concurrency.
  • Amdahl's Law means that cherry-picking a few optimization opportunities won't be enough to scale to a large number of cores. We will have to work parallelism through every stage of the browser's processing.
  • It's not at all clear how much of what the browser does can be parallelised. Ras' group's work on parallel lexing is a small first step. Can one write an HTML5 parser in parallel, and if so, how? How about JS/DOM manipulation? Layout? Rendering? The programming model we expose to Web developers is inherently single-threaded and I don't think we should change that.
  • Massive changes to browser code (including rewriting it in a different language) are very risky, and can damage compatibility credibility.
  • Our debugging and profiling tools are already crappy for mostly single-threaded code. Debugging and profiling parallel code will be much harder.
  • Browsers rely on a lot of platform library code. That code also needs to be refitted for scalable parallelism or it will become a bottleneck.

Bottom line: writing a good browser engine is already really hard. Requiring scalable parallelism adds a whole new dimension of engineering difficulty, and poses a bunch of research-level questions which we don't yet have answers for.

I plan to blog about my thoughts on these issues and how they might be solved. But my impression is that it will be very difficult for existing browsers to move into multicore territory, due to language choices, the difficulty of refitting existing code, compatibility with existing code consumers, a desire to not compromise single-core performance, and the enormous risk of a large investment in a long development cycle with no guarantee of success.

I believe there is a huge disruption opportunity here: an effort focused on building a new parallel browser, with no interest in single-core performance, which can rethink all engineering choices, could start by filling a niche and eventually dominate. It could also flame out spectacularly! It would be great territory for a startup, if VCs weren't obsessed with creating the next great social networking site. It's good territory for research --- Ras is on the money. I do hope that if such a disruptive effort wins, the code is open source. I'm not sure what can be done to help ensure that.

Wednesday, 12 September 2007

Trust No One

When is an ASCII space (0x20) not a word separator?

When it's followed by a combining mark (e.g., COMBINING ACUTE ACCENT a.k.a. Unicode character 0x301).

According to ATSUI, anyway. Uniscribe disagrees and refuses to combine marks with space characters. It will allow combination if you stick a ZWJ (0x200D) in between. Gah!

We've also discovered that ATSUI's font fallback machinery often likes to choose different fonts for the mark and the character it combines with. Madness!

This is life working on Web browsers: the environment is so complex, any assumptions you make will be violated sooner or later.

Tuesday, 11 September 2007

Federal Government

Work on garbage collection tends to assume that all objects in the heap are being managed by a single collector. In practice this is a problem because people want to have code written in different languages sharing data. We face this problem internally because we have C++ and Javascript code sharing objects. We also face it externally because people want to be able to use Python and other languages to write XUL applications or even potentially on the Web.

One approach is to implement all those languages on top of the same collector, which often means implementing them on the same virtual machine. That can be OK but it can impose some undesirable tradeoffs when the constraints of the virtual machine or collector don't allow the language to be implemented well. You often end up with a language that has some parts filed off because they don't fit ... e.g. IronRuby can't implement first-class continuations. Or you take a memory footprint hit because the object representation imposed by the runtime isn't optimal for your favourite language. Or maybe the engineering effort required to reimplement the language is prohibitive.

Another approach is to use reference counting at the language boundaries, which is what we mainly rely on. This is nice because it can be implemented in "wrapper objects" without touching the internals of the language implementation. But this approach is unable to collect cycles of objects that span language boundaries.

If we want to collect cross-language cycles it seems inevitable that we need a way to trace through all object graphs. That means we have to modify, or at least have knowledge of, the guts of the language runtimes involved. Given that requirement, what is the minimally invasive strategy that can collect cycles across language boundaries? My hypothesis: a distributed mark and sweep global collector.

The idea is that periodically there is a "global garbage collection", which works like this:

  1. "Stop the world": All language runtimes advance all their active threads to GC safe points. In Spidermonkey, for example, this means reaching the end of a JS "request".
  2. "Mark": Request all language runtimes mark all object in their heaps from their roots.

    • While marking, a runtime may encounter a reference to an object managed by another runtime. Such references are dispatched to be marked by that runtime.

  3. "Sweep": Request all language runtimes sweep their heaps and discard unmarked objects.
  4. "Restart": Tell all runtimes to restart their threads.

Extra steps would be required to support finalization and weak references, but it would be similar to the way they work in regular mark and sweep.

A list of active language runtimes must be maintained. Cross-runtime mark requests requires a map from memory areas to runtimes. This map could be gathered from all the runtimes at the start of each global GC. This map would also help conservative collectors detect references.

This approach is really very simple. It does not constrain the representation of a runtime's objects in any way other than that they must be addressable via pointers. It does not impose any read or write barriers. It does not constrain synchronization or threading models, other than requiring runtimes to have some way to stop their worlds. It allows runtimes to use both exact and conservative roots and tracing internally.

It also allows considerable flexibility for the runtime's internal garbage collection strategy. Obviously mark-and-sweep algorithms fit right in. Most generational collectors in practice use mark and sweep for the tenured objects, so they can fit in here as long as any objects to be passed across language boundaries are immediately tenured. A pure copying collector won't work without stationary proxy objects since we don't want to force all runtimes to handle object motion; I think that's a reasonable restriction. Incremental collection algorithms will work OK, they will just have to queue the marking of edges that lead outside the runtime's own heap until the next global GC. In the same way a runtime can collect more frequently than the global GC (as long as it preserves objects that may be referenced by foreign runtimes).

One wrinkle that I haven't fully thought through yet is how runtimes with exact collection handle mark requests into their heaps that don't point to actual objects (as could be sent by a runtime using conservative collection). I would probably require that the runtime can identify whether an object is being referenced. I would probably also require that interior pointers can be resolved to mark the enclosing object. These requirements would be limited to objects that have been exposed to foreign runtimes.

This approach does not address other issues like how cross-runtime calls can actually happen. That problem is actually much easier; it can be solved non-invasively using wrappers, for example. I think it's good to factor these issues out.

It would be an interesting experiment to take a few popular language runtimes and modify them to participate in a scheme like this. I hope it can be done with a very small amount of code. In practice the hardest problem is probably convincing people that this is worth adding, supporting and shipping in the core of their runtimes, and agreeing on the nitty-gritty details of how the interface would work.

I'd really like to see something like this happen, because the competing approach of shoehorning everything into a single runtime is technically unsatisfying, will slow language and runtime innovation, and will be politically infeasible for the free software world. I first wrote about this idea in 2001 and I'm surprised that I haven't seen it reinvented either in research or the general software world.

Thursday, 6 September 2007


A NZ Herald columnist, Tapu Misa, just "came out" as a Christian convert in her column for today. Amazing! When I used to read the Herald more, she vocally expressed disdain for "religion" (unsurprising, that seems to be shared by most New Zealanders). You never know what God is going to pull off.

The Herald's opinion page will be a bit more interesting now; if I recall correctly she's somewhat left wing, and will therefore be an interesting counterpoint to Garth George, who seems almost the stereotypical crusty right-wing conservative Christian.

Tuesday, 4 September 2007


CoScripter is a fantastic tool. I really hope we can find a way to bring it to the mass market, both because it's cool and useful, and because it leverages the "open box" nature of the Web, the same openness that allows Greasemonkey to exist. That openness is good and powerful, and a strong technical advantage that the Web has over its competitors.

I want to mention that this work originated with Rob Miller and his students
Greg Little and Michael Bolin at MIT with their development of Chickenfoot. Greg developed Koala/CoScripter while an intern at IBM Almaden.

It's especially interesting because I've known Rob for a long time --- I remember talking to him when he was a prospective graduate student visiting CMU, back in 1995. I've also known Tessa, an IBM author on the Koala paper, for several years. Small world eh!

Monday, 3 September 2007


Today's Herald has some marvellous gloom-oriented reporting. The key body text:

Billions of unexpected dollars are being pumped into the New Zealand economy from record payouts to Fonterra shareholders from a combination of spectacular increases in global dairy prices and growth in demand - further boosted by the removal of European Union farmer subsidies.

The headline: Fonterra's big payout a challenge for economy.

It's tough to put negative spin on an unexpected export income windfall, but the New Zealand pundits rise to the challenge. Reliance on the primary sector apparently makes our economy "cyclical and fragile". Anyone notice that the much-coveted Silicon Valley high-tech industry is also "cyclical and fragile" in a big way? The extra incoming cash will "keep pressure on the kiwi [dollar]" --- i.e. we all become too rich relative to the rest of the world. And it will "fuel labour shortages" --- i.e. unemployment will be too low. Horrors!

It seems obvious to me that high-end agriculture is a great business to be in for the long term. The number of people who can afford good quality food is increasing rapidly, and the land available to produce it is decreasing. New Zealand is in a great position to benefit from this. I'm all for growing high-tech industry here --- I'm working hard on that myself --- but there is no need to be embarrassed by our agricultural base. Sometimes I wonder if economists quoted in the press are the equivalent of the moronic "analysts" that plague coverage of the computer industry, regurgitating outdated wisdom until the new reality is blindingly obvious.

Bruce Willis And Quantum Computing

I saw Die Hard 4 on Saturday night. The premise is that a gang of bad guys lay waste to US infrastructure by breaking into computer systems, until Bruce Willis beats them up. It was fun. The details of the computer action were predictably laughable --- why can't Hollywood get this right more often than once a decade?

However, the underlying premise isn't all that unrealistic. It seems plausible a really determined and well-funded group could pull off an attack on that scale. In fact they could probably pull off something much more alarming. The US military and other organizations are deploying large numbers of UAVs (robotic planes) and now also armed robotic ground units. Of course they work hard to protect the command and control structure but given the state of computer systems today there must be vulnerabilities, especially if someone could insert a human spy. Internet crooks taking control of a battalion of robotic soldiers doesn't really bear thinking about. That'd be a great way for a hostile power to negate the US military advantage... and an interesting movie premise!

There's another great hard-science movie premise in this genre: someone should remake Sneakers around the premise that someone has secretly made a breakthrough in quantum computing and implemented Shor's algorithm, giving them the ability to break RSA (i.e. almost all deployed public-key cryptography). The inventor could be either rescued or assassinated by Bruce Willis, depending on how the screenwriters want to play it. Aside from the movie potential, there is also the intriguing likelihood that this will eventually really happen! (Minus the Bruce Willis part.)