Eyes Above The Waves

Robert O'Callahan. Christian. Repatriate Kiwi. Hacker.

Tuesday 5 June 2007

Shrike

On my last trip to the USA I met up with Steve Fink again who I used to work with at IBM Research. He mentioned that the Shrike bytecode manipulation toolkit that I wrote several years ago is still being used and maintained at IBM and in fact has just be released (again!) as open source as part of the T.J. Watson Libraries for Analysis (WALA). This is cool...

I wrote Shrike because I was doing a lot of bytecode instrumentation of large projects and I was dissatisfied with the convenience and performance of available bytecode manipulation libraries; at the time, BCEL was the main contender, although IBM had an internal library called JikesBT. I'd written a bytecode reader for my thesis work at CMU and I thought I had some ideas that would let me write a library that was smaller, faster, and much more convenient to use than the existing options. In my humble opinion, I succeeded :-).

Shrike was based on a number of ideas:


  • BCEL parses .class file into one large mutable data structure representing a class and its methods. You can then modify the structure, e.g. by inserting instructions into a method, and then BCEL will compile that structure back into a .class file. This is slow, because a lot of data has to be parsed and then recompiled even if it's not modified. It's also a clumsy interface to use, because inserting instructions into a method requires manual adjustment of things like branch targets, exception handler ranges and the like. It's *especially* clumsy when you want to compose multiple independent instrumentation passes applied to the same original code --- you run head on into the problem of one pass instrumenting the code belonging to previous passes. Shrike avoids these problems by offering a patch-based API for code instrumentation. This means you specify your desired changes as fragments of code that should be inserted at specific points. You can add a number of patches, but none of the underlying code is changed until you request those patches be committed. Then Shrike makes one pass over the code, applying all patches and updating all branch targets, exception handlers etc for you. It's very convenient and very fast.
  • Similarly, Shrike doesn't actually give you any mutable object representing a class. There are class reader objects, which give you access to immutable class data, and class writer objects, which essentially are sinks you can use to compile a class. There are utilities for copying a class from a reader to a writer, of course with the option to modify things along the way. Methods and other class data which is not being modified can be copied directly without being parsed or recompiled! This is a huge performance win.
  • Shrike's code representation is better than BCEL's. BCEL uses a linked list of instruction objects; instruction objects are mutable and therefore created fresh for each instruction of each method. In Shrike instruction objects are immutable and the code for a method is simply an array of instruction objects. Inserting or deleting from an array would be slow but thanks to the patch API we don't have to do that. Because instruction objects are immutable, they can be shared within and between methods. This makes bytecode parsing very fast; for the Java bytecode instructions that take no operands, parsing just requires looking up the instruction object in a table and then storing a reference to that object in the code array.
  • Shrike also handles the exception handler ranges in a nice way. In the class file, each exception handler specifies a range of instructions it covers. This is a real pain to manipulate directly if you're moving instructions around or if you want to add or remove handling from a range of instructions. In Shrike there's just an array of exception handler lists, one list for each instruction. These lists are immutable so they can be shared by all instructions which have the same exception handlers covering them. This is very easy to manipulate and it turns out that converting handler ranges to these lists and back again is very efficient.
  • The Java bytecode format has an incredibly nasty feature called JSR --- little subroutines that you could use inside the code for a single method. This was an ill-advised attempt to avoid code explosion in the compilation of "try ... finally" blocks. Several people, including me, scored rather meaningless publications by discussing how to handle this quirk in type systems. Shrike hides it completely by converting all incoming JSR usage to inlined bytecodes; Shrike users never see JSRs.
  • The Java class file format restricts methods to 64K bytecodes. Some of the instrumentation I was doing was blowing this limit in a few methods (primarily static class initializers for certain library classes). Shrike deals with this problem by auto-magically breaking up oversized methods into one method calling helpers. It uses heuristics and may fail, but never actually does. It's a rather esoteric feature but invaluable when you need it...
  • Shrike provides simple dataflow analysis of its code representation. You can use this to figure out the type of any local variable or expression stack location at any instruction, which is very useful for instrumentation. Shrike also provides a bytecode verifier based on that, which is very useful for finding bugs in your instrumentation...

Despite all those features, Shrike is still a pretty small library. The emphasis on using immutable objects and arrays keeps the number of classes small. Using one class for instructions of the same kind (e.g., binary operators) also helps. It's also factored into one component "ShrikeCT" for reading and writing class files, and another component "ShrikeBT" for manipulating method bytecode; you can easily pick and choose components.

Looking back, I'm quite proud of Shrike. It was a real pleasure to do a small, focused project from scratch and get a lot of things right.

BCEL seems to have been eclipsed by ASM for manipulating Java bytecode. ASM seems to use the same "copy reader to writer" paradigm that Shrike does, so it's definitely going in the right direction. It would be interesting to compare ASM with Shrike in more detail, but unfortunately I don't have the time.

Zooming further out, I think it would be a fascinating intellectual and educational exercise to study situations where we have multiple libraries that export basically the same underlying functionality through different APIs --- bytecode instrumentation engines, traditional operating system kernels, traditional GUI toolkits, string libraries, bignum libraries, and so on. It seems likely one could make very concrete and defensible judgements that certain API design decisions were made better in one library than another. This could lead to some reliable and teachable "do's and don'ts" of module interface design, backed by interesting examples, that I think could be incredibly valuable to students of software development. This is important because genuinely teachable facts are heartbreakingly rare in software design. There's a great book here that someone needs to write!



Comments

David Molnar
The Valgrind paper in PLDI this year does a nice job of describing just such tradeoffs for dynamic binary instrumentation. Really well-done piece of work all around...