Copying pixels around quickly within a surface is a bit of a problem in cairo. cairo currently doesn't specify what should happen if a surface is both the source and destination of a drawing operation, and at least pixman-based backends do weird things in many of these cases. I think unspecified behaviour is bad, and cairo should just define this case so that the "obvious thing" happens --- it's as if you made a copy of the surface and used that copy as the source while drawing into the surface. It's not hard to implement that way for the general case, and for specific cases we can optimize self-copies quite easily to avoid the temporary surface (e.g., when the self-copy is a simple integer translation of the surface contents). So we can fix the self-copy problem in cairo.
My bigger problem has been wrestling with the OS X Core Graphics APIs. Currently cairo-quartz uses CGBitmapContexts for its surfaces, because they give us easy direct access to pixel data and they're easy to use as CGImages if we draw one surface to another with tiling. However, Apple docs enthusiastically recommend using CGLayers for improved performance. Indeed the QuartzCache example shows a significant performance boost from using CGLayers instead of CGBitmapContexts. So I've got a patch that adds to cairo-quartz the ability to create surfaces backed by CGLayers.
Unfortunately, even these CGLayer surfaces don't make scrolling as fast as I want it to be. Shark profiling shows that we spend 20% of the time in the self-copy, moving pixels in the CGLayer, and then 60% of the time actually copying that CGLayer to the window. That sounded a bit wrong since the whole point of CGLayers is that you can efficiently blit them to the window. So I looked closer at the profile and noticed the CGLayer copy time is all in argb32_image_mark_rgb32, while if I profile the QuartzCache CGLayer example (modified to more closely emulate what we do when scrolling), copying the CGLayer to the window uses sseCGSBlendXXXX8888 (via CGSBlendRGBA8888toRGBA8888). Googling, plus inspection of the machine code of those functions, shows that argb32_image_mark_rgb32 is a fairly nasty slow fallback path, and CGSBlendRGBA8888toRGBA8888 is the really fast thing that we want to be using. So the question remains, why are we getting the slow path in my layers code while the QuartzCache example gets the fast path?
This was really painful to answer without CoreGraphics source code. I did some reverse engineering of argb32_image, but it's a huge function (20K of compiled code) and that wasn't fruitful. Instead I wrote experimental code, and eventually just wrote some code that creates a layer for the CGContext we obtain from [[NSGraphicsContext currentContext] graphicsPort] in our NSView's drawRect method, and immediately blits that layer to the context. Still slow.
Clearly there's something wrong with the state of the CGContext of our NSView. But how does our NSView set up its context differently from the QuartzCache example? Then I recalled that we return YES for isFlipped in our NSView to put it into the coordinate system other platforms expect --- (0,0) at the top left. So I tried returning YES for isFlipped in the QuartzCache example --- bingo, it slows right down and takes the argb32_image_mark_rgb32 path. In fact it looks like returning YES for isFlipped slows down a lot of the APIs used in QuartzCache...
Conclusion: for high performance graphics on OS X, avoid isFlipped. Or something like that. It's fairly bogus that adding such a simple transform to the CGContext would hurt performance so much, but so it goes...
I'm not quite sure how we're going to fix this in Gecko yet. I'll be brainstorming on #gfx on IRC tomorrow!