Tuesday 2 November 2010
Implementing A High-Performance Emulator In Javascript Using Run-Time Code Generation
For a while I've been thinking about exploiting fast browser JITs and JS "eval()" to build really fast emulators and other language runtimes. Tonight I was feeling jumpy so I went ahead and hacked something up.
I started by defining a really simple imaginary CPU instruction set, writing a simple program for that instruction set, and implementing a trivial interpreter to execute it. It executes 100M instructions in about 7.5 seconds (Firefox opt trunk on my laptop), which is already pretty good! But we can do a lot better.
My idea is to write a trace compiler in Javascript for the emulated CPU. It's actually very easy. When we need to execute CPU instructions starting at a given PC address, we build up a string containing the Javascript code that needs to be executed for each instruction starting at the given address. I terminate the trace after any branch instruction. I then wrap the code string up into a function and call eval() on it to get a real Javascript function I can execute to get the effect of the trace. From there the browser JIT will compile my function into real native machine code, which I can call over and over again, every time control reaches that address. We're effectively translating from the emulated CPU all the way down to bare-metal x86, and in just a couple of hundred lines of JS! (Ok, plus the JS engine :-).) This very hacked-together compiler runs 100M instructions in about 1.2 seconds. That's a 6x speedup over the interpreter, but more importantly it means I'm getting over 80MIPS of performance, in the browser! Insane!!!
The compiler is super-simple and it could probably be improved in many ways to get even better performance. The key optimization I implemented was to deal with the "cycle counter" my imaginary CPU has --- essentially an emulated interrupt timer that counts down once per instruction executed. I generate two versions of each trace --- one with cycle counter checks at each instruction, and another with all the checks removed and a simple guard at the top of the trace to ensure that the cycle counter won't fire before we reach the end of the trace; if the guard fails, we take the slow check-on-every-instruction path. The demo page shows the Javascript code that the compiler emits to do this.
Of course, there isn't anything special about JS here, you could do the same thing with Java or .NET bytecodes, or in other languages with "eval". Still, since there are a lot of browser-based emulators out there, it would be cool to see people try this technique. It's hard to predict how well it would do in a real emulator, but I have hopes. Real CPU instruction sets involve more complex instruction decoding than my toy, and the trace compiler is able to hide that.
Update I thought of a couple more simple optimizations overnight, plus Jesse suggested one, so I implemented them all in about 15 minutes. Developing compilers with an edit/reload cycle is fun!
- Jesse suggested using "new Function(code)" instead of eval(). Total time now 1.05 seconds.
- I wrapped the body of each function in "while (true)". Then, wherever the trace sets the new PC to the PC at the start of the trace (i.e., it's a simple loop), we can replace that with "continue;" to actually loop inside our trace function without returning to the trace dispatch loop. Total time now 0.8 seconds.
- I turned the "memory" and "regs" arrays into JS typed arrays (using feature detection to fall back for older browsers). Total time now 0.65 seconds.
So performance about doubled, now we're at 150MIPS. That's enough for one day!
(For perspective, the maximum clock of my laptop CPU is about 3GHz, so we're executing one emulated instruction every 20 native processor cycles.That's amazing!)
Oh, and just for kicks, totally unscientific benchmarks of other browsers. IE9 beta 1: interpreter 7.3 seconds, trace compiler 2.8 seconds. Chrome 7.0.517.41: interpreter 6.7 seconds, trace compiler 1.6 seconds. Neither of those browsers support typed arrays. I should note that typed arrays sped up the interpreter a lot in Firefox, now about 3.3 seconds.
Comments
The plan is for Emscripten, an LLVM-to-JS compiler, to use the same technique. The compiler is written in JS, and it already takes advantage of that by being able to run the same JS code during compilation and during running. Later on this will include generating JS code at runtime in an optimized way, and then exactly as in your demo here, eval() will let the JS engine JIT turn that into native code.
Chris: fixed link.
/be
Did a lot of retires (stable results) and tested sanity using external clock.
Bug on file? Link for your fans to follow...
/be
Thanks
For instance, emulating a processor which itself is running a BASIC interpreter, the loop might be to look up a few variables by name, fetch their contents, perform a math op (say floating point add), store the result, increment a counter, test the counter, and loop. Each piece would require dozens of instruction sequences to be executed.
If a few hundred sequences overwhelms the js generated code cache, then the speedup would be limited to whatever hotspots could be produced in a much smaller window, eg, the loop where the floating point mantissa is shifted and aligned, or the loop where the search is done to find the variable "I" in the symbol table.
I'm using new Function(code), where code is some fragment of Erlang bytecode (BEAM) translated into JS, and which do not contain any branches. Quite easy and gives huge performance boost over simple switch over every opcode.
Could you please also benchmark your code in Opera 10.70 or 11?
But how to do feature detection of typed arrays in Javascript to avoid breaking the script with respect to other browser vendors?
Do you have a snippet of that?
Currently, writing a pixel to the canvas requires 4 array accesses, in order to write each sub-pixel value.
Writing the 32-bit color in 1 array access via typed array view of the canvas would be a great improvement.
lobster: yes!
A little hint: Your test would work if you wrote self.ArrayBuffer.