Wednesday, 26 September 2007

Parallel DOM Access

Suppose we want to exploit parallelism to load and run an existing Web application faster. How might that be done?

Here's a very high level diagram of the data flow inside the browser. I believe it's reasonably accurate for Webkit as well as Gecko.


There are some missing edges in some cases; for example, the style system can affect the DOM via XBL. I'm going to gloss over those for now.

Observe that the DOM is really the center of the universe. It's a big blob of mutable state that everything else depends on. Any parallelization of single-page applications is going to have to carefully manage access to the DOM.

For performance reasons, changes to the DOM are lazily propagated to the style system, layout and rendering. The goal is to batch changes so that these phases are not reexecuted too often --- they're expensive, even when incremental. DOM updates make method calls to registered listeners. Listeners record which styles might have been affected and post a restyle event (if one hasn't already been queued) to be processed later. Similarly, they set dirty bits in the frame tree and then post a reflow event. On the other side, the parser batches its appends to the DOM as well. It's primarily only script execution that makes many fine-grained, interleaved reads and writes.

[There are times when scripts need access to up-to-date style and layout data. In Gecko, at those times we invoke FlushPendingNotifications to force synchronous restyling and layout. Because this is already a significant performance hit, Web developers should already be trying to avoid triggering it (e.g., by accessing DOM offset* properties).]

I think we could leverage these access patterns to help parallelize browser execution:


  • Parsing can be mostly asynchronous; it should only need to synchronize with the DOM when it's ready to append a bunch of elements.
  • Styling, layout and rendering can take a snapshot of the DOM and use it while the real DOM is being updated by other tasks. It's OK for the display of a page to lag slightly behind; it already does today. The snapshot will be refreshed each time we need to update the window again, or when layout results must be return synchronously to scripts.
  • To run scripts, such as JS event handlers, in parallel, we could use optimistic concurrency via transactions. The idea is that we run a script speculatively against a snapshot of the DOM, recording which parts of the DOM it reads and suppressing its changes to the state of the DOM and other parts of the world. When the script has completed, we synchronize with the real DOM and check whether anything the script depended on has changed. If not, then we can commit the script's global changes. Otherwise, its execution is invalid; we throw out its changes and reexecute it all over again.
  • Some DOM-accessing code cannot be executed speculatively, such as plugins, or Javascript that takes actions that cannot be rolled back, such as sending a message to a server. (At CMU, the canonical example of an irrevocable message was "fire missile!") Therefore at any point in time there can be (at most) one thread accessing the "real DOM" which is not speculative and will never be asked to abort.

This model would be particularly useful for Firefox where extensions written in Javascript have access to the DOMs of all browsed pages. We could allow extensions to execute in parallel with client code as long as at least one of them is speculative.

None of this is easy to implement. Making real snapshots of the DOM would be prohibitively expensive, so some kind of versioning or copy-on-write scheme is required. Which state constitutes "the DOM" that needs versioning is actually a little nebulous; for example, would we version the state of <canvas> elements? How about DOM Storage or the proposed DOM SQL API? However, the good news is that some of this can be done incrementally. For example, we could start with a limited slice of DOM state that is versioned, and decree that speculative script accessing any other state is aborted, and then gradually increase the scope of versioning.

There's a lot more to think and talk about in this area.



14 comments:

  1. Roughly what percentage of time do the various blocks in your diagram take up today, in Gecko?
    I'm trying to think what the best-case win for your parallelized scheme is, and I'm wondering whether it's much of a win. It seems like the biggest wins here come when one of the following is true:
    * Parsing takes a long (CPU) time, so we frequently have a significantly updated DOM that we'd like to layout and render, if it didn't block further parsing. Does this really happen? Seems like it wouldn't be likely.
    * Script is doing lots of updates to the DOM for a long period of time. I guess this could be the case if we have a lot of Greasemonkey scripts (or extensions, as you mentioned), but I wonder how applicable it is in general.
    Even in these cases, the win is mainly that your page appears to refresh slightly more quickly. It doesn't seem like in the common case this makes your rendering noticeably faster, or makes script run lots faster (unless you had lots of simultaneous readers of the DOM data... maybe with Gears' multithreaded JS?). Am I totally missing your point?
    I feel like a bigger win for Firefox would come from either finding a way to pull UI event handling/updating out onto its own thread/process, or at least making all the other pieces of the browser enforce maximum running times on themselves before giving the message loop another crack at things. That would take care of one of my biggest annoyances with Firefox, namely the way that large pages which finally finish loading or Flash videos which are doing lots of processing or network access cause the browser's UI to hang. If the app was always responsive it would go a long way to making it feel both faster and more stable.

    ReplyDelete
  2. Snapshot DOM (SDOM) is exactly what I've been thinking lately, though only to be used by style/layout/rendering (slr). To help that, calls to synchronously flush layout should be reduced. Making sure that FlushPendingNotifications (or whatever in other engines) is used only when really needed might help performance even now, and if slr could be moved to a separate thread, then even more.
    One thing to think about is event handling. Events are targeted based on layout objects etc. If layout is using SDOM, and DOM has been changed, what should be done? Possibly resynchronize SDOM based on version information in DOM. Hopefully synchronization doesn't happen too often, like for every mousemove event.
    Could the slr be split somehow to be run on several threads. Rendering/painting using a snapshot of layout... mem usage would climb if there was many kinds snapshots of the system...

    ReplyDelete
  3. Robert O'Callahan26 September 2007 10:12

    Peter:
    > Roughly what percentage of time do the various
    > blocks in your diagram take up today, in Gecko?
    Depends entirely on the page, but they're all significant.
    > * Parsing takes a long (CPU) time, so we
    > frequently have a significantly updated DOM
    > that we'd like to layout and render, if it
    > didn't block further parsing. Does this really
    > happen? Seems like it wouldn't be likely.
    Parsing is often a significant amount of CPU time. It would be very nice to be able to overlap it with inline script execution and with incremental layout and rendering.
    > Script is doing lots of updates to the DOM for
    > a long period of time. I guess this could be
    > the case if we have a lot of Greasemonkey
    > scripts (or extensions, as you mentioned), but
    > I wonder how applicable it is in general.
    Script-heavy apps are getting more and more common. And many are already coded with lots of callbacks, handling AJAX responses, user input, periodic timers, animation, etc. These don't all do "lots of updates" to the DOM, but that's good...
    > Even in these cases, the win is mainly that
    > your page appears to refresh slightly more
    > quickly. It doesn't seem like in the common
    > case this makes your rendering noticeably
    > faster, or makes script run lots faster (unless
    > you had lots of simultaneous readers of the DOM
    > data... maybe with Gears' multithreaded JS?).
    > Am I totally missing your point?
    I'm hypothesizing that we can a) improve page load times and b) improve interactive response times.
    Definitely, one of the first things to be done before stepping down this path would be to test this hypothesis, probably by instrumenting an engine to capture detailed application behaviour and then simulate a parallelization scheme to estimate the benefit.
    I'm also not suggesting this is the only way to exploit parallelism. This is just one way that might be valuable to us.
    > I feel like a bigger win for Firefox would come
    > from either finding a way to pull UI event
    > handling/updating out onto its own
    > thread/process, or at least making all the
    > other pieces of the browser enforce maximum
    > running times on themselves before giving the
    > message loop another crack at things.
    I definitely agree. Your first suggestion is hard for us because of the way XUL, JS and the extension system work (which is also one of our strengths, so I don't think we should just walk away from it). Supporting parallel DOM access would actually be a good step towards supporting running browser UI concurrently with content script, without breaking the current model.
    Your second suggestion is also on the money. Two things that we should do post-1.9 are "interruptible reflow" (detect pending events, stop reflow, and then restart it after handling the events), and putting (some) plugins in their own process(es). But these are separate issues really, and much easier to solve.
    smaug:
    > Events are targeted based on layout objects
    > etc. If layout is using SDOM, and DOM has been
    > changed, what should be done?
    We can actually tell whether an event is targeting a dirty frame or not. I think in common cases we could get away without flushing layout every time the mouse moves.
    > Could the slr be split somehow to be run on
    > several threads.
    Yep but they probably all want to use the same snapshot.
    > mem usage would climb if there was many kinds
    > snapshots of the system...
    Yes. It's imperative that you only store the differences between the snapshots, not entire copies of the DOM.

    ReplyDelete
  4. I would expect that the web2.0 eye candy stuff with animated pull up, down, through, whatever would be the biggest challenge. And stuff like finkle's bubbles demo.
    All of them create content dynamically, partly even lots of it, and then repeatedly modify the DOM to change the style and layout, and look like a worst-case scenario here to me.
    Being able to animate the style natively could give an perf improvement path to webauthors in this case, too.

    ReplyDelete
  5. Robert, I think you're going in the right direction. Here are some more thoughts on DOM snapshots from a while back:
    http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2005-July/004304.html

    ReplyDelete
  6. Do you envision something like RCU as a way to snapshot the DOM? (http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html)
    At one point I looked at using RCU to make access to the XPCOM component registry lock-free in the dominant cases, but it didn't make it to the top of the stack.

    ReplyDelete
  7. Robert O'Callahan26 September 2007 21:01

    I don't think RCU is quite what is needed here.

    ReplyDelete
  8. "Snapshots" and "Copy on write" sound like you're thinking about implementing the DOM as a "Purely Functional Data Structure"?

    ReplyDelete
  9. A few years I tried implementing a thread aware browser and it was a complete disaster. It is very difficult to get the locking right and adding the locks everywhere slows everything down. I concluded that trying to split a page up using fine grained parallelism wasn't a good strategy. We did end up implementing image downloads in threads.
    There are other schemes for increasing parallelism in a browser. For example each page could be in it's own process with a process managing the browser frame and tabs. Like MS OLE. This keeps pages from messing with each other's performance.
    Schemes using shared memory can work too. Things are split into different processes to achieve parallelism and then communicate via asynchronous messaging. Access to the data structures is serialized but things like parsing or image decoding happen in the parallel processes and send messages to update the DOM.
    Lately we have been experimenting in git with fine grained parallelism. So far it is always slower except when we can go parallel for something that takes several seconds to complete.

    ReplyDelete
  10. What about using transparent wrappers for DOM method return values (and out-of-scope JS values) to defer execution of results until necessary?
    The Javascript value returned from DOM methods (and potentially passed to them) represents an opaque token. When the script reaches a point where it needs to get the real value - an if statement, passing the value to a single-threaded object or combining it with a value independent of the shared state, it ends up being a synchronization point where the code would wait for the correct value.
    This could limit the amount of rollback for functions. The opaque tokens could even be passed back into the DOM, where the results would be evaluated when available.
    I think the PyPy developers worked on something like this.

    ReplyDelete
  11. On the TV show "24" last year, the US government fired a missile in a game of brinksmanship with another country but eventually redirected it into an ocean. I'm not sure how realistic it is, but perhaps missile launches aren't as irrevocable as they used to be.

    ReplyDelete
  12. Robert O'Callahan28 September 2007 22:58

    You can abort the flight but you can't get the missile back into its launch tube :-)

    ReplyDelete
  13. The language you are using here makes me think of file systems... If you view the DOM as a file system with elements/nodes as directories and attributes as files (with content being attributes on anonymous nodes), exactly like the DOM inspector does. Documents become 'volumes' and are 'mounted' at various points.
    Maybe the right thing to do is to think like this. Abandon C++ object thinking for a while and build the DOM as an in memory file system, using a special memory allocator. Then layer the objects back onto the 'files' as an abstraction.
    In this case the active set of C++ objects could be a lot smaller than the actual set of DOM nodes, because you could only have active C++ objects for the nodes currently being accessed by methods.
    And if you had some sort of 'mmap' support you could make the construction of C++ nodes be very memory conservative for nodes with lots of data (images, text).
    This would allow you to leverage a lot of work on parallel access to file systems, including locking, snapshots, copy-on-write, transactions, caching, garbage collection, etc.
    I'm not sure that the style system is at the right level in that picture though? The style information is layered on top of the DOM, like a unionfs in some ways... I guess this is what your dashed arrows are trying to say. A unionfs would also allow you to have a layer on top of a parsed HTML page which represented the active version of the page, where all the JavaScript changes would go, and which would be layered on the original DOM to provide the back/forward cache.
    The unionfs idea could be used to do the transactions. Snapshot the DOM, 'mount' a new unionfs layer on the snapshot, then try to fold it back in after completion.
    You could also use this 'memory file system' and the real mmap to cache the DOM on disk rather than raw pages. It would give you a good speedup, since you could avoid parsing the page.
    One would have to do some careful thinking about what information you would want in the file system. Do creation time and ownership really have a real purpose in the DOM? The DOM also requires order preserving between directory entries. You would also want to have very small blocks, and forget about disk platters etc. But on the plus side, there is a lot of free code, and you can sift through it looking for good ideas. You could probably even prototype the system on current memory file system like tmpfs, using it to develop the DOM C++ code, and seeing what your access patterns looked like, and then using that profiling info to build your special file system.
    Maybe at a lower level, the problem here is that the browser is a seen as an application, and not the kernel of a new type of OS.
    OTOH, I probably don't know what I'm talking about.
    -Jeremy

    ReplyDelete
  14. deepwalkercr__(4t)__gmale1 November 2007 20:13

    Did you ever see the parrallel access all atomic functions hashtable on Google tech vids
    Very light on processor time, n level versioning.
    Could be viewed as a file system, but very simple and very very fast

    ReplyDelete