Now that accelerated layers have been turned on for a while in Firefox 4 betas, I want to explain what this means for HTML5 <video> playback in Firefox and how it compares with what Adobe is doing in Flash 10.2.
I've already written about our layers framework and how that contributes to GPU-accelerated rendering. For each frame, our video decoder libraries (libvp8, libtheora) produce a YCbCr (aka YUV) image in their output buffer in system memory. This is usually, but not always, in 4:2:0 format. In the video decode thread we wrap this buffer up into a YCbCrImage object, which makes a copy. (We have to copy here because the decoder libraries overwrite their output buffer for every frame, and we're decoding a bit ahead of the current frame to avoid jitter.) Exactly how that copy is made depends on the layers backend:
- For GPU backends that support multithreaded access to GPU resources (currently our D3D10 backend), the video decoder thread copies the output buffer directly into VRAM. This is the most efficient approach.
- For GPU backends that don't support multithreaded access, we copy the output buffer in system memory. The buffer is uploaded to VRAM later when we paint on the main thread.
- For CPU-only rendering, we convert the output buffer from YUV to RGB (while also scaling it to the expected output size) on the video decoder thread.
Actual rendering happens on the main thread. When we composite an ImageLayer containing a YCbCrImage into the window, the GPU converts to RGB at draw time using a shader program (and also handles scaling). When we're not using the GPU, we just draw the pre-converted image (possibly with some rescaling if the output size has changed since we did the convert-and-scale step).
This architecture seems optimal given the constraints we're currently working with, although it will need to evolve in a few directions. Other than being fast, it works well for Web authors. Unlike Adobe's "stage video", it just works without authors having to do anything. You can overlay and underlay content around the video without losing acceleration. You can wrap it in a container with CSS opacity and 2D transforms, scroll it and clip it, without losing acceleration. (Currently a few effects will cause us to deaccelerate --- if the video has an ancestor element with 'border-radius' and non-visible 'overflow', an ancestor with CSS mask, filter or clip-path, or an SVG ancestor. We'll fix those cases post-FF4 by accelerating them.)
This design will need to evolve in a few directions. We currently don't have hardware decoders for our video formats, but that will change as hardware support for VP8 spreads, and we'll need to take advantage of it. We may need to introduce a new Image type or even a new Layer type to encapsulate hardware decoding. I hope that there will be an efficient way to get decoded video data into textures, otherwise we'll have to limit the use of hardware decoders to whatever special cases their APIs support.
Some devices have GPUs with bad enough shader performance that doing YUV conversion with shaders is a noticeable performance hit. But some of them have native support for YUV-format textures or surfaces with dedicated hardware for YUV conversion. We should use those features. This won't require any architectural changes.
In Fennec, we composite the layer tree in a different process to the content process that actually does the video decoding. This will later be true for desktop Firefox too. This requires us to ship our YCbCr frames through shared memory or shared VRAM buffers, but we should be able to keep the same basic architecture.