Thursday 1 April 2010
Layers
Bas already wrote a good introduction to our new layers framework for cross-platform, GPU-assisted, retained-mode rendering. I've been meaning to blog about it myself, since it's what I've mostly been working on for the last few months, but to be honest I held back since I wasn't 100% sure whether, how or when it was all going to work. Now that fullscreen video playback is GPU-accelerated on trunk (currently only on Windows with GL), and I've made considerable progress in other areas of the work, I'm a bit more confident :-).
As Bas explained, the goal is to use the GPU for fast compositing and other effects. We also need to enable compositing on a thread other than the "main thread" where most browser operations (such as Javascript) run, so that animations and video playback can run smoothly even when the main thread is blocked on other work for significant periods of time. (Frameworks such as Core Animation --- which Webkit uses on Mac --- do this, and it rocks.) Off-main-thread compositing means we need some kind of 2D scene graph that's accessible off the main thread. To be GPU-friendly it behoves us to make the elements of this graph as large and as few in number as possible, e.g., they shouldn't be as fine-grained as individual CSS boxes. Like Core Animation, we call the nodes of this graph --- which is a tree --- layers. (Not to be confused with Netscape 4's layers!)
We've proceeded incrementally. First we defined a fairly simple API with only two kinds of layers: ThebesLayers and ContainerLayers. ThebesLayers are leaves representing one or more display items that will be rendered using Thebes (our wrapper around cairo). A ThebesLayer might be implemented using a buffer in VRAM that those display items are rendered into. ContainerLayers don't draw anything on their own but group together a set of child layers. Any layer can have various effects applied to it --- currently rectangular clipping, fractional opacity, and/or a transform matrix.
Our layers API was carefully designed so that although it admits a retained-mode implementation with off-main-thread compositing, you can actually implement it in immediate mode with main-thread compositing. For the first layers landing, we created a BasicLayers that does just that. Every time we paint, we construct a new layer tree and BasicLayers traverses the layer tree to render every layer into the destination context using cairo. This isn't any faster than our old code; we end up doing almost exactly the same cairo calls the old code did, but it was a good way to get started. BasicLayers will remain useful in the long term because it can render into any cairo context; for example we can print layer trees. I want to avoid having "layers" vs "non-layers" rendering paths in our code.
We construct a layer tree by walking a One interesting property of this approach is that we only create layers for visible content. This is very important since we need to conserve VRAM and keep layers overhead to a minimum. The next step towards accelerated video rendering was to rework video to use layers for rendering instead of drawing pixmaps with cairo. We created a new layer type, ImageLayer, to render pixel buffers in various formats, in particular the planar YCbCr data emitted by our Theora decoder (and many other decoders). Our video decoder runs on its own thread and we want to eventually be able to play video frames without rendezvousing with the main thread, so we need to update the current frame of an ImageLayer on non-main threads. However for simplicity we only allow the layer tree to be updated by the main thread. We solved this by introducing a thread-safe ImageContainer object. The main thread creates an ImageContainer and an ImageLayer which references the ImageContainer. When it's time to display a new video frame, the decoder thread creates a YCbCrImage and makes it the current image of the ImageContainer. Whenever an ImageLayer is composited into the scene, we use the ImageContainer's current Image. Images are immutable so they're easy to use safely across threads. Internally, the BasicLayers implementation of YCbCrImage continues to perform the same CPU-based colorspace conversion that we were doing before. That conversion happens on the decoder thread when the Image is set in the ImageContainer. One other change here was the transform which letterboxes video frames into the <video> element was moved from being a cairo operation to being a transform on the video layer (which BasicLayers implements using cairo!). Once again, there was no real behaviour change, just a refactoring of our internal structures. In parallel with all this, Bas Schouten was working on a real GPU-based implementation of the layers APIs. First he tackled D3D10, then GL. The contents of ThebesSurfaces are rendered by the CPU (or maybe not if we're using D2D on Windows) and uploaded to D3D or GL buffers in VRAM. They can then be composited together by the GPU. YCbCrImages are uploaded as texture data and then during composition they're converted to RGB by the GPU, for a performance win. The letterboxing transform is also applied by the GPU, for a big performance win. That GL code has landed on trunk. Right now there are some serious limitations which I've glossed over and will talk about in a future post, so we can't just enable GL for all browser rendering. However, we've created a way for toplevel chrome windows to opt-in to accelerated rendering, and we've applied this to the window created for full-screen video. Right now this only works on Windows, but soon someone will get it working on Mac and X. It's ironic since our long-term GPU-acceleration solution for Windows is D3D, not GL (due to driver issues), but that happens to be Bas' favourite platform :-). This post is already too long, but before I go I want to mention a few approaches we didn't take. One approach to using the GPU would be to just use a GPU-based backend for cairo, like cairo-D2D and cairo-gl. We like those backends and we do plan to use them, but cairo is fundamentally an immediate-mode API and I explained above why some kind of 2D scene graph API is essential. Another obvious approach would have been to adopt an existing retained-mode, GPU-accelerated scene library such as Clutter. I actually went to Emanuele Bassi's Clutter talk at LCA in January and talked to him afterwards. Clutter is not suitable for our use for two main reasons: its API is totally GL-based and reworking it for D3D or something else would really suck, and Clutter does its compositing on the main thread (the same thread as the GLib event loop), and this would be very hard to change due to existing Clutter API. Another approach would have been to follow Webkit and abstract over platform scene frameworks --- e.g. using Core Animation on Mac and maybe Clutter on GTK. The problem is those frameworks don't exist where we need them; the only comparable thing on Windows that I know of is WPF's milcore, which is unusable by third party apps (and isn't on WinXP). I just mentioned how Clutter's threading model isn't what we want. Even Core Animation on Mac isn't as capable as we want; for example it can't currently support SVG filters directly. If you own the platform you can perhaps assume these frameworks will evolve to fit your needs; we don't. There's a lot more to talk about --- cross-process layers, what's currently being worked on, how we're going to overcome current limitations like the fact that we currently rebuild the layer tree on every paint. I'll try to blog again soon.
Comments
Compared with about a 900 MHz Dell Intel, Ubuntu Karmic. Using the two "high"-quality videos here:
http://people.xiph.org/~greg/video/ytcompare/comparison.html
Firefox 3.5 performs very poorly on OGG, but the Flash video is fine (I think a plugin is used, but don't recall which). In contrast, Totem player is fine with either.
I can supply details but don't want to pollute Bugzilla if these two problems are (1) well known and (2) being worked on. Thanks.
One fundamental issue is that integrating video playback and rendering into a browser is necessarily more complicated, harder to get right, and higher overhead than using a standalone video player. Having said that, I believe that once we enable GL acceleration for Linux we should be able to be just as good as any standalone player --- modulo bugs.
On Linux in particular there is one other issue, which is the plethora of sound drivers and audio backends. On any given system, one sound framework might work a lot better than another, and there's no way of knowing what the right one to use is :-(.
in theory, we could off-load some of the user-related code, like event processing, to separate threads; it's something that has been proposed multiple times, and if you don't touch the scene graph (or don't touch it that often) you can still avoid the death-by-lock-contention scenario.
unfortunately, we cannot really off-load the rendering sequence to another thread: multi-threading in GL is a huge minefield, and too many scenarios will result in the pipeline blowing up in your face. :-(
looking forward to have a look at the code! :-)
Our concurrency story is definitely not going to be based on a big lock, though. We'll have layer trees on the "main thread" of each browser process; updates to those layer trees are logically copied over to "shadow" layer trees that are used for compositing. (Of course we'll try to avoid making actual copies of graphics data.)