Tuesday, 4 May 2010

CGLayer Performance Trap With isFlipped

For the last few days I've been working on maximising performance of my new code for scrolling with "retained layers". Basically it's all about fast blitting: I want to quickly "scroll" the pixels within a retained layer surface, and then quickly copy that surface to the screen/window. This will have many advantages in the long run, but initially I have to work hard to make it as fast as the current scrolling approach for simple cases, because in simple cases the current approach is to call platform API to move pixels on the screen and then repaint the strip that scrolled into view, and that's hard to beat.

Copying pixels around quickly within a surface is a bit of a problem in cairo. cairo currently doesn't specify what should happen if a surface is both the source and destination of a drawing operation, and at least pixman-based backends do weird things in many of these cases. I think unspecified behaviour is bad, and cairo should just define this case so that the "obvious thing" happens --- it's as if you made a copy of the surface and used that copy as the source while drawing into the surface. It's not hard to implement that way for the general case, and for specific cases we can optimize self-copies quite easily to avoid the temporary surface (e.g., when the self-copy is a simple integer translation of the surface contents). So we can fix the self-copy problem in cairo.

My bigger problem has been wrestling with the OS X Core Graphics APIs. Currently cairo-quartz uses CGBitmapContexts for its surfaces, because they give us easy direct access to pixel data and they're easy to use as CGImages if we draw one surface to another with tiling. However, Apple docs enthusiastically recommend using CGLayers for improved performance. Indeed the QuartzCache example shows a significant performance boost from using CGLayers instead of CGBitmapContexts. So I've got a patch that adds to cairo-quartz the ability to create surfaces backed by CGLayers.

Unfortunately, even these CGLayer surfaces don't make scrolling as fast as I want it to be. Shark profiling shows that we spend 20% of the time in the self-copy, moving pixels in the CGLayer, and then 60% of the time actually copying that CGLayer to the window. That sounded a bit wrong since the whole point of CGLayers is that you can efficiently blit them to the window. So I looked closer at the profile and noticed the CGLayer copy time is all in argb32_image_mark_rgb32, while if I profile the QuartzCache CGLayer example (modified to more closely emulate what we do when scrolling), copying the CGLayer to the window uses sseCGSBlendXXXX8888 (via CGSBlendRGBA8888toRGBA8888). Googling, plus inspection of the machine code of those functions, shows that argb32_image_mark_rgb32 is a fairly nasty slow fallback path, and CGSBlendRGBA8888toRGBA8888 is the really fast thing that we want to be using. So the question remains, why are we getting the slow path in my layers code while the QuartzCache example gets the fast path?

This was really painful to answer without CoreGraphics source code. I did some reverse engineering of argb32_image, but it's a huge function (20K of compiled code) and that wasn't fruitful. Instead I wrote experimental code, and eventually just wrote some code that creates a layer for the CGContext we obtain from [[NSGraphicsContext currentContext] graphicsPort] in our NSView's drawRect method, and immediately blits that layer to the context. Still slow.

Clearly there's something wrong with the state of the CGContext of our NSView. But how does our NSView set up its context differently from the QuartzCache example? Then I recalled that we return YES for isFlipped in our NSView to put it into the coordinate system other platforms expect --- (0,0) at the top left. So I tried returning YES for isFlipped in the QuartzCache example --- bingo, it slows right down and takes the argb32_image_mark_rgb32 path. In fact it looks like returning YES for isFlipped slows down a lot of the APIs used in QuartzCache...

Conclusion: for high performance graphics on OS X, avoid isFlipped. Or something like that. It's fairly bogus that adding such a simple transform to the CGContext would hurt performance so much, but so it goes...

I'm not quite sure how we're going to fix this in Gecko yet. I'll be brainstorming on #gfx on IRC tomorrow!


  1. There is a ages-old trick, which I'm still using today to implement fast vertical scrolling.....
    Don't move the data!
    Just move the "virtual top edge" around. It requires that you paint the buffer to the screen in 2 stages, and it involves some book-keeping logic. But it's pretty fast, because there is no data copying, and it's just 1 extra bitblt operation.
    Feel free to connect me if this explanation isn't clear.

  2. Robert O'Callahan4 May 2010 at 21:12

    It's completely clear. I've thought of it myself. It's a little more complicated than that for us because we can scroll horizontally so we'd need a "virtual top-left point" and sometimes use 4 blits to the screen. But where it gets a bit more complicated --- and slower --- is when we need to repaint an area that spans the actual edges of the buffer...
    I'm trying to avoid methods that require tradeoffs like that.

  3. Why not use Cairo's OpenGL backend and skip Core Graphics altogether? Even the worst graphics hardware won't break a sweat doing stuff like this.

  4. Robert O'Callahan5 May 2010 at 12:33

    cairo-gl is still quite slow for many things.