For a while I've been thinking about exploiting fast browser JITs and JS "eval()" to build really fast emulators and other language runtimes. Tonight I was feeling jumpy so I went ahead and hacked something up.
I started by defining a really simple imaginary CPU instruction set, writing a simple program for that instruction set, and implementing a trivial interpreter to execute it. It executes 100M instructions in about 7.5 seconds (Firefox opt trunk on my laptop), which is already pretty good! But we can do a lot better.
My idea is to write a trace compiler in Javascript for the emulated CPU. It's actually very easy. When we need to execute CPU instructions starting at a given PC address, we build up a string containing the Javascript code that needs to be executed for each instruction starting at the given address. I terminate the trace after any branch instruction. I then wrap the code string up into a function and call eval() on it to get a real Javascript function I can execute to get the effect of the trace. From there the browser JIT will compile my function into real native machine code, which I can call over and over again, every time control reaches that address. We're effectively translating from the emulated CPU all the way down to bare-metal x86, and in just a couple of hundred lines of JS! (Ok, plus the JS engine :-).) This very hacked-together compiler runs 100M instructions in about 1.2 seconds. That's a 6x speedup over the interpreter, but more importantly it means I'm getting over 80MIPS of performance, in the browser! Insane!!!
The compiler is super-simple and it could probably be improved in many ways to get even better performance. The key optimization I implemented was to deal with the "cycle counter" my imaginary CPU has --- essentially an emulated interrupt timer that counts down once per instruction executed. I generate two versions of each trace --- one with cycle counter checks at each instruction, and another with all the checks removed and a simple guard at the top of the trace to ensure that the cycle counter won't fire before we reach the end of the trace; if the guard fails, we take the slow check-on-every-instruction path. The demo page shows the Javascript code that the compiler emits to do this.
Of course, there isn't anything special about JS here, you could do the same thing with Java or .NET bytecodes, or in other languages with "eval". Still, since there are a lot of browser-based emulators out there, it would be cool to see people try this technique. It's hard to predict how well it would do in a real emulator, but I have hopes. Real CPU instruction sets involve more complex instruction decoding than my toy, and the trace compiler is able to hide that.
Update I thought of a couple more simple optimizations overnight, plus Jesse suggested one, so I implemented them all in about 15 minutes. Developing compilers with an edit/reload cycle is fun!
- Jesse suggested using "new Function(code)" instead of eval(). Total time now 1.05 seconds.
- I wrapped the body of each function in "while (true)". Then, wherever the trace sets the new PC to the PC at the start of the trace (i.e., it's a simple loop), we can replace that with "continue;" to actually loop inside our trace function without returning to the trace dispatch loop. Total time now 0.8 seconds.
- I turned the "memory" and "regs" arrays into JS typed arrays (using feature detection to fall back for older browsers). Total time now 0.65 seconds.
So performance about doubled, now we're at 150MIPS. That's enough for one day!
(For perspective, the maximum clock of my laptop CPU is about 3GHz, so we're executing one emulated instruction every 20 native processor cycles.That's amazing!)
Oh, and just for kicks, totally unscientific benchmarks of other browsers. IE9 beta 1: interpreter 7.3 seconds, trace compiler 2.8 seconds. Chrome 7.0.517.41: interpreter 6.7 seconds, trace compiler 1.6 seconds. Neither of those browsers support typed arrays. I should note that typed arrays sped up the interpreter a lot in Firefox, now about 3.3 seconds.