Saturday 4 June 2016
I'm working on a binary rewriting engine. A core component of such engines is a disassembler to parse machine code and extract necessary information about each instruction: its length, operands, and effects. For x86-64 models this is challenging because they have a vast number of instructions, the list keeps growing, and the semantics and encodings are complicated.
I'm standing on the shoulders of giants by using the DynamoRio standalone disassembler/assembler library. It's wonderful, but any general-purpose library for this must wrestle with a difficult tradeoff between generality and performance/footprint. I want to know for each instruction whether it belongs to one of a few narrow classes (e.g. direct control transfer, indirect control transfer, conditional branch, system call), but I don't otherwise care what the opcode is. I want to know for each instruction whether it reads or writes R8-R11, and I want to know whether it reads or writes a RIP-relative memory location, but I don't otherwise care what the operands are. I want to know whether it reads or writes certain flags, and unlike other DynamoRio clients, I want to distinguish "undefined" flag effects from flag writes. So when building a disassembler to serve multiple clients (which you do want to, because it's expensive to build and maintain), you have to produce a lot of information that any given client isn't going to use. This is a problem because disassembly is a critical component of the overhead for a high-performance binary rewriting engine.
DR mitigates this problem somewhat by giving you options to control how much information is produced and producing some information lazily. But I guess that that doesn't get close to the performance of a disassembler optimized for my particular client. For example, DR contains large tables containing opcodes and operand information for a vast number of SSE/AVX instructions, most of which information I don't care about. An optimized disassembler for my client could use much smaller tables.
So here's a goal: design a disassembler framework that can generate disassemblers highly optimized for specific clients.
Here's how I think this could work:
- Design a rich AST for instructions containing all the information any client wants. A disassembler is a function from a string of bytes plus some state bits (e.g. CPU mode) to an Instruction object in this AST.
- Create a domain-specific purely-functional language to express such functions. Make it look as much like vendor CPU documentation as possible. Make it as easy as possible to add new instructions. (DR is not bad but it's not quite as easy as it could perhaps be).
- Actually write an x86-64 disassembler in this language.
- For each client, write a function from Instruction objects to some client-specific data type containing just the information the client needs. Write this function in a tightly constrained DSL that focuses on simple projections --- dropping certain fields, mapping constants onto smaller sets of values, etc.
- The disassembler you want is then the composition of the base disassembler with the per-client projection. Build an optimizing compiler for these composed functions.
I think this is tractable because the domain is very constrained. The input to the composed function is a string of a relatively small (bounded) number of bits. The x86-64 instruction set is baroque but it does have exploitable structure. Going crazy with solver-powered program synthesis should be very feasible.
Like any new disassembler, you'd want to test by generating output in standard text formats and comparing to existing disassemblers.
For a stretch goal, try generating an assembler from the disassembler description. There are actually two kinds of assembly needed by binary rewriting engines. They need to tweak disassembled instructions (e.g. replace one operand with another) and emit the results. They also need to assemble from scratch new instructions with a limited set of specific opcodes and operands. In the latter case, a function call to assemble an instruction can usually be reduced to a a very short sequence of bit-manipulation and store instructions. For both cases, in principle assembly is just the problem of finding a string of bits which disassembles to an Instruction object satisfying certain constraints. Often multiple encodings are possible, in which case you usually want the shortest instruction but hints might be needed to select the optimal form. My intuition is that generating efficient assemblers from the disassembler DSL should be feasible.
This seems like the sort of problem that researchers might have tackled before. The closest thing I can find is Ramsey and Fernández's SLED language in the New Jersey Machine-Code Toolkit from the mid-90s, which doesn't consider the composition and optimization problem.
Anyway, this seems like a pretty cool project. There's scope for doing cutting-edge program synthesis work, and the results would be incredibly useful. Unfortunately I don't have time to do it :-).