Monday, 31 July 2017

Selecting A Compression Algorithm For rr

rr's traces are large. Memory-mapped files account for a lot of that, but the most efficient way to handle them is "zero copy" file cloning so we don't want to compress them during recording. Most of the rest is recordings of data copied into tracee address spaces by the kernel, and plus snapshots of registers, and these data are extremely compressible — often containing long runs of zeroes, for example. For a long time rr has used zlib to compress the non-mapped-file trace data, and zlib's 'deflate' algorithm often achieves compression of 8x or more.

Of course zlib is pretty old and significantly better algorithms exist, so now seems like a good time to reevaluate that decision. I used the Squash framework to compare zlib to two contenders, brotli and zstd, on actual rr trace data: a single, fairly short Firefox run, 828MB uncompressed, treated as independent 1MB chunks because that's what rr does. Here are the results:

I've omitted compression levels that took more than 20 seconds to compress the data. Currently rr uses zlib level 6, which takes just over 12 seconds to compress the data. Data compression occurs in parallel with the rest of recording, and uses multiple cores when it needs to, so in practice is seldom a performance bottleneck.

On this data, both brotli and zstd beat zlib by a significant margin, so we're definitely leaving some performance on the table by sticking with zlib. In particular, given the same time budget, zstd can shave 14% off the size of the non-mapped-file trace data, and brotli can shave off 17%. Alternatively, for the same trace size we could use much less CPU time — zstd level 1 compresses slightly better than zlib level 6, at 10x the speed!

For rr I think brotli level 5 is an obvious choice. For some reason there's a sudden improvement in compression at level 5, where it passes zstd and reaches roughly its optimal compression given a reasonable time budget. At level 5 we're shaving 17% off the current size and also taking 32% off the CPU time.

Apart from the performance, brotli also has a better licensing story. zstd has Facebook's standard patent license, which terminates if you sue Facebook for any patent infringement, and some organisations aren't comfortable with that. Apparently people have done patent searches and haven't found any Facebook patents covering zstd, but that is not wholly reassuring (while also being mystifying — if they're not applying for relevant patents, why not make that clear?). On the other hand, Google has made a clear commitment to license brotli royalty-free with no such conditions. Of course there could be third-party patents, but if they became a problem it would be easy to switch rr's algorithm (especially compared to the trouble they would cause for Web sites and browsers!).

Of course there are lots of other compression algorithms I could evaluate, but I guess if there are any further gains to be had, they would be very minor.

Update Unfortunately Ubuntu doesn't have a brotli library package. (Fedora does.) So, using brotli would mean everyone building rr on Ubuntu has to build brotli themselves first, or we vendor brotli into rr (or we do something truly insane like have rr pull and build brotli at build time if necessary). None of these approaches are appealing :-(. I guess there's also "rewrite rr in Rust so we can use cargo to have reasonable dependency management", which is appealing but not currently practical.

I'm leaning towards vendoring brotli into rr.

8 comments:

  1. Hi Robert. FYI, the issue about brotli and woff2 being bundled and duplicated into various web engines was discussed during the 2016 edition of the Web Engines Hackfest. Since then, I have started to submit some patches to be able to use shared libraries and public headers so that these libraries can be packaged in Linux distros (this is what led to [1] btw) and used by web engines (especially WebKit and Gecko). Things seem to move very slowly on brotli/woff2 side but my plan is to continue pushing for that when brotli 1.0.0 is released [2].

    Frédéric Wang

    [1] https://github.com/google/brotli/issues/551
    [2] https://github.com/google/brotli/issues/483

    ReplyDelete
    Replies
    1. Thanks! For now I'm just going to bundle brotli (0.6.0) with rr.

      Delete
  2. Any particular reason why you did not test lzma (in practice xz)? It is available on every platform and patent free. Is it just too slow?

    ReplyDelete
    Replies
    1. I did benchmark lzma2/xz, but they are far too slow. Every single compression level took more than 20s (i.e. failed to appear on that chart).

      Delete
    2. As you'd expect, xz compression is better. A couple of data points:
      At 50s, xz shaves 20% off the current (zlib) size.
      At 181s, xz shaves 24% off the current (zlib) size.

      But at the 50s mark you're spending 6x the CPU time of brotli-5 to shave off 4% of its trace size. That feels like a poor return.

      But if we ever want to do something like recompress traces offline for long-term storage, we could use xz.

      Delete
  3. Not that even if you had chosen zstd, it only recently stabilized its stream and (I think) API, and Ubuntu 16.04 released with an old version...

    ReplyDelete
  4. Just out of curiosity, was the zstd test using the multithreaded library? And does the dictionary system have any effect on compression size? (I expect not, but…)

    ReplyDelete
    Replies
    1. I did not experiment with dictionaries. rr compresses in 1MB blocks so I don't expect dictionaries to have much effect.

      I did not use zstd's multithreaded capabilities, because rr already chunks the data into 1MB blocks and compresses them on independent threads.

      Delete