Tuesday, 2 November 2010

Implementing A High-Performance Emulator In Javascript Using Run-Time Code Generation

For a while I've been thinking about exploiting fast browser JITs and JS "eval()" to build really fast emulators and other language runtimes. Tonight I was feeling jumpy so I went ahead and hacked something up.

I started by defining a really simple imaginary CPU instruction set, writing a simple program for that instruction set, and implementing a trivial interpreter to execute it. It executes 100M instructions in about 7.5 seconds (Firefox opt trunk on my laptop), which is already pretty good! But we can do a lot better.

My idea is to write a trace compiler in Javascript for the emulated CPU. It's actually very easy. When we need to execute CPU instructions starting at a given PC address, we build up a string containing the Javascript code that needs to be executed for each instruction starting at the given address. I terminate the trace after any branch instruction. I then wrap the code string up into a function and call eval() on it to get a real Javascript function I can execute to get the effect of the trace. From there the browser JIT will compile my function into real native machine code, which I can call over and over again, every time control reaches that address. We're effectively translating from the emulated CPU all the way down to bare-metal x86, and in just a couple of hundred lines of JS! (Ok, plus the JS engine :-).) This very hacked-together compiler runs 100M instructions in about 1.2 seconds. That's a 6x speedup over the interpreter, but more importantly it means I'm getting over 80MIPS of performance, in the browser! Insane!!!

The compiler is super-simple and it could probably be improved in many ways to get even better performance. The key optimization I implemented was to deal with the "cycle counter" my imaginary CPU has --- essentially an emulated interrupt timer that counts down once per instruction executed. I generate two versions of each trace --- one with cycle counter checks at each instruction, and another with all the checks removed and a simple guard at the top of the trace to ensure that the cycle counter won't fire before we reach the end of the trace; if the guard fails, we take the slow check-on-every-instruction path. The demo page shows the Javascript code that the compiler emits to do this.

Of course, there isn't anything special about JS here, you could do the same thing with Java or .NET bytecodes, or in other languages with "eval". Still, since there are a lot of browser-based emulators out there, it would be cool to see people try this technique. It's hard to predict how well it would do in a real emulator, but I have hopes. Real CPU instruction sets involve more complex instruction decoding than my toy, and the trace compiler is able to hide that.

Update I thought of a couple more simple optimizations overnight, plus Jesse suggested one, so I implemented them all in about 15 minutes. Developing compilers with an edit/reload cycle is fun!


  • Jesse suggested using "new Function(code)" instead of eval(). Total time now 1.05 seconds.
  • I wrapped the body of each function in "while (true)". Then, wherever the trace sets the new PC to the PC at the start of the trace (i.e., it's a simple loop), we can replace that with "continue;" to actually loop inside our trace function without returning to the trace dispatch loop. Total time now 0.8 seconds.
  • I turned the "memory" and "regs" arrays into JS typed arrays (using feature detection to fall back for older browsers). Total time now 0.65 seconds.

So performance about doubled, now we're at 150MIPS. That's enough for one day!

(For perspective, the maximum clock of my laptop CPU is about 3GHz, so we're executing one emulated instruction every 20 native processor cycles.That's amazing!)

Oh, and just for kicks, totally unscientific benchmarks of other browsers. IE9 beta 1: interpreter 7.3 seconds, trace compiler 2.8 seconds. Chrome 7.0.517.41: interpreter 6.7 seconds, trace compiler 1.6 seconds. Neither of those browsers support typed arrays. I should note that typed arrays sped up the interpreter a lot in Firefox, now about 3.3 seconds.



31 comments:

  1. Or you could try sleeping. But thanks :)

    ReplyDelete
  2. Hey the link is broken :)

    ReplyDelete
  3. Cool stuff!
    The plan is for Emscripten, an LLVM-to-JS compiler, to use the same technique. The compiler is written in JS, and it already takes advantage of that by being able to run the same JS code during compilation and during running. Later on this will include generating JS code at runtime in an optimized way, and then exactly as in your demo here, eval() will let the JS engine JIT turn that into native code.

    ReplyDelete
  4. I'd use "new Function(s)" rather than "eval(s)". eval's scoping rules hurt the JS engine.

    ReplyDelete
  5. Robert O'Callahan2 November 2010 21:23

    Thanks for the tip Jesse!
    Chris: fixed link.

    ReplyDelete
  6. You want (x >>> 0) instead of (x & 0xffffffff), just for source brevity, but it should also win some perf too.
    /be

    ReplyDelete
  7. Robert O'Callahan2 November 2010 22:39

    OK, I just tried >>> 0 over & 0xFFFFFFFF, and it's a significant perf hit for some reason.

    ReplyDelete
  8. Strange... Here I get 1247ms for trace compile but just 4555ms for interpreter — a 3.65x speedup. (1.6GHz Merom)
    Did a lot of retires (stable results) and tested sanity using external clock.

    ReplyDelete
  9. Robert O'Callahan2 November 2010 23:13

    Yeah, typed arrays sped up the interpreter. I'll update the post.

    ReplyDelete
  10. > OK, I just tried >>> 0 over & 0xFFFFFFFF, and it's a significant perf hit for some reason.
    Bug on file? Link for your fans to follow...
    /be

    ReplyDelete
  11. Thanks for demonstrating that this kind of optimization can be done. Now I'll work on fast-tracking some fixes to my js gameboy color emulator, so that I can start work on optimizing the instruction execution.
    Thanks

    ReplyDelete
  12. I bet using typed arrays instead of normal JS arrays effectively worked around bug 608733 and made us trace the main interpreter loop.... Hence the speedup on the interpreter part.

    ReplyDelete
  13. This is very cool, but the working set of the program is minuscule. I've written a few emulators in C and the programs that are being executed can be 8-20K instructions. Of course only a small number of instruction sequences will be in play at a time, but it might be a few hundred. Would the hot-spot cache keep this many routines alive at a time?
    For instance, emulating a processor which itself is running a BASIC interpreter, the loop might be to look up a few variables by name, fetch their contents, perform a math op (say floating point add), store the result, increment a counter, test the counter, and loop. Each piece would require dozens of instruction sequences to be executed.
    If a few hundred sequences overwhelms the js generated code cache, then the speedup would be limited to whatever hotspots could be produced in a much smaller window, eg, the loop where the floating point mantissa is shifted and aligned, or the loop where the search is done to find the variable "I" in the symbol table.

    ReplyDelete
  14. Robert O'Callahan3 November 2010 04:40

    It's certainly true that the JS JIT's code cache could be an issue.

    ReplyDelete
  15. Yeah, though as an example, the GameBoy / GameBoy Color contains somewhere under 512 op codes (Main + CB).

    ReplyDelete
  16. Witold Baryluk3 November 2010 14:57

    I am using exactly same method in my Erlang to JS converter :)
    I'm using new Function(code), where code is some fragment of Erlang bytecode (BEAM) translated into JS, and which do not contain any branches. Quite easy and gives huge performance boost over simple switch over every opcode.
    Could you please also benchmark your code in Opera 10.70 or 11?

    ReplyDelete
  17. Robert O'Callahan3 November 2010 19:46

    Opera 11: the interpreter doesn't seem to work, the compiler gets 3.8 seconds.

    ReplyDelete
  18. ROC: I'm wondering if this method is still good for the js gameboy color emulator. hmm. First though, I'm working on optimizing the interpreter (interpreter core should come before the tracer, from my understanding). Traditional optimizations like compiling the conditional memory banking to anonymous after their types have been detected seems to work for me (20% CPU win).

    ReplyDelete
  19. A year ago or so I wrote something similar, for MIPS: http://codu.org/projects/jsmips/ . I haven't measured the, err, MIPS of my JSMIPS emulator, but it uses the same basic technique of JITting to JavaScript code and eval'ing that. Plus JSMIPS can run a substantial subset of UNIX (the original idea was to run vim in a browser; I can in fact run vim in a browser now, but it's pretty darn slow :( )

    ReplyDelete
  20. Robert O'Callahan4 November 2010 22:18

    Sounds cool!!!

    ReplyDelete
  21. Typed Arrays sounds like an interesting feature to use.
    But how to do feature detection of typed arrays in Javascript to avoid breaking the script with respect to other browser vendors?
    Do you have a snippet of that?

    ReplyDelete
  22. It would be great to use typed arrays to write to the canvas.
    Currently, writing a pixel to the canvas requires 4 array accesses, in order to write each sub-pixel value.
    Writing the 32-bit color in 1 array access via typed array view of the canvas would be a great improvement.

    ReplyDelete
  23. Robert O'Callahan6 November 2010 05:46

    Oliver: It's there in the demo --- "if (ArrayBuffer)"
    lobster: yes!

    ReplyDelete
  24. I filed Bug 609965 because typed array access to 2D Canvas doesn't seem to work on Firefox nightly.

    ReplyDelete
  25. Out of curiosity, why do you stop tracing after every branch, rather than just after conditional branches? It seems like that wouldn't be much harder, and would generate more useful traces (ISTR that's what TraceMonkey did originally, too)

    ReplyDelete
  26. Robert O'Callahan7 November 2010 11:13

    It's a simple way of controlling trace length. If you trace through unconditional branches, then you have to do something to ensure tracing stops when it encounters an infinite loop.

    ReplyDelete
  27. Why not break the trace only on "back" jumps?. In this way you preclude infinite loops from happening (only "forward" motion is allowed in traces), and you'll be able to trace through unconditional jumps.

    ReplyDelete
  28. Robert O'Callahan8 November 2010 11:11

    That's an option, sure. There's a huge range of strategies you could use for trace generation and optimization. I'm not claiming mine's the best, this was just an experiment and demo.

    ReplyDelete
  29. I would love to test your example. But in Google Chrome and Firefox, I get "ReferenceError: ArrayBuffer is not defined". In MSIE 8 and MSIE 7, I get an "Object Error".
    A little hint: Your test would work if you wrote self.ArrayBuffer.

    ReplyDelete
  30. ROC: I'm trying to see how I'd JIT the GameBoy Color CPU op codes, since there's very very tight synchronization between certain LCD status flags and the sub-scan line rendering, which complicates a simple JIT implementation. Anyways, the GameBoy Color emulation is way faster than before, even though it's still interpreter all the way, due to crazy and liberal application of typed arrays (Video code too now). Anyhow, I timed my emulator against other js gameboy emulators, and so far it wins. I'm wondering if JIT would still apply here since the bit-state flags could be branchy for the JIT.

    ReplyDelete
  31. http://www.gtoal.com/sbt/

    ReplyDelete