Tuesday, 1 November 2005

"Premature Optimization Is The Root Of All Evil" Is The Root Of Some Evil

There's a folklore quote "premature optimization is the root of all evil", attributed to Tony Hoare and Donald Knuth. A variant is due to my PhD advisor's father Michael Jackson: "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.". Unfortunately --- and I'm not the first to note this --- this advice, taken out of context and followed slavishly, often leads people into deep trouble. The problem is that it collides with another well-known feature of software development: it gets more expensive to make changes to the system later in its development, especially cross-cutting changes that impact module interfaces. If you leave optimization to late in development, then profile it and find that fixing the performance bottleneck requires a major change in your design, or forces major module interface changes, then you have a serious problem.

Clearly one should therefore design a system with an eye to where the bottlenecks may be, and try to ensure the design has enough flexiblity to capture the optimizations that will be required. This is dangerous and impossible to get right all the time, but such is life in software development. It is not enough to just think about high-level performance, because sometimes low-level coding issues do have a significant impact and they may be constrained by the design. For example in Gecko the designers created module interfaces that relied on using virtual method calls almost everywhere. Individual virtual calls are very cheap but used very frequently --- and without the support of advanced JIT compilation techniques --- they are a significant performance drag. In some cases, where actual polymorphism is in use, devirtualizing the calls requires significant and expensive restructuring of the code.

I find it useful to do a sort of "gedankenprofiling". Guess or measure some important workloads, then sketch out mentally how each workload will be processed by the proposed designs, focusing on the apparent bottlenecks. Try to guess how much cost per unit work will be incurred at the bottleneck by alternative designs. Do not choose the design that minimizes the cost; instead, choose the design of minimal complexity that can smoothly extend to the cost-minimal design. Then when you start implementing and measuring and discovering your mistaken assumptions, you have the best chance of still getting to a good place relatively cheaply.

Right now I'm redesigning the way we paint frames in Gecko and thinking a lot about how it will perform in various scenarios, and trying to put in just the right amount of flexibility to handle potential future issues. Fortunately I don't have to guess so much since we already know a lot about where our performance problems are.


10 comments:

  1. Off topic but, it looks so nice in column layout!

    ReplyDelete
  2. Can you define what you mean by a "virtual method call"?

    ReplyDelete
  3. Robert O'Callahan2 November 2005 09:21

    A C++ "virtual" method. It compiles to an indirect branch, that is, instructions that load the address of a function from memory and then transfer control to the loaded address. On the other hand normal "nonvirtual" aka "static" function calls typically compile to direct branch instructions where the destination address is part of the instruction. (The situation is often more complicated; for example, normal function calls across shared library boundaries are often implemented using indirect branches.)

    ReplyDelete
  4. ###Fortunately I don't have to guess so much since we already know a lot about where our performance problems are.
    I guess that's the main idea about not doing premature optimization. Yes, it's longer to perform the change now, but you only perform the change on parts of the project which actually need it, plus doing this so late in the project means that you also have a good idea of which modifications would hurt the other modules.
    Still, I'd say that the whole notion of using C++ from A to Y (Z being JavaScript and Xul) for development is something of a premature optimization by itself.
    There are numerous programming languages which make it easier to prototype and explore techniques, while the use of a low-level language such as C++ forced you to
    * design the XPCom architecture itself
    * write everything in an XPCom-compliant way, including modules, factories, somewhat manual reference-counting, CIDs, manual error-handling, etc... most of which should not be necessary for simple prototypes
    * clutter the code with an impressive number of not-completely-safe dynamic type casts
    ... all this while having the large performance hit of XPCom's omnipresent virtual methods, nsISupportsArray's unsafe & could-be-faster containers, etc.
    I tend to believe that there are lessons to be remembered for, say, Gecko 3. I would personally advocate the use of a high-level language connectable with low-level primitives. Perhaps something as high-level as Python, OCaml or Scheme, perhaps something slightly more industrial such as Felix or D, or perhaps even something on the Mono platform. Am I wrong ?
    P.S.:
    Note that I'm an extension developer by hobby, a programming language researcher/designer by job, and a major C++ skeptic by experience, which makes me heavily biased.

    ReplyDelete
  5. >If you leave optimization to late in development, then profile it and find that fixing the performance bottleneck requires a major change in your design, or forces major module interface changes, then you have a serious problem.
    Do you think this was the case with Gecko? Are most of its performance problems fixable or too deeply embedded to make it worth the effort?

    ReplyDelete
  6. Robert O'Callahan4 November 2005 19:55

    James: Yes, in many places Gecko had problems that were designed in and have proven expensive to fix. Fortunately we have fixed many of them and papered over others. But as Web content evolves and as we add new features new problems become apparent...
    Yoric: C++ can be a pain but I doubt another language would be a real win. One problem is that GC everywhere would induce significant pause times and more real-time GCs have significant performance penalties. A bigger problem is that Gecko gets embedded in programs written in many languages (Java, Python, Mono/C#, C, C++) and if it had its own complex runtime that could get very messy. Another problem is that we'd want to ensure whatever infrastructure we used was well supported and perform well across a large variety of platforms --- and going to stay that way --- as well as open source. Other than C/C++, maybe only Mono would qualify, but Mono GC needs serious work.
    In general I don't believe in Parnas' maxim "build one to throw away ... you will, anyway" or the concomitant strategy of building a throwaway prototype. It might be useful for small projects but with something like Gecko you need a lot of infrastructure before you can realistically evaluate your prototype. Once you've built all that, you really really want to reuse at least some of it.

    ReplyDelete
  7. ###C++ can be a pain but I doubt another language would be a real win.
    Note that I'm not actually claiming that any specific language would be a big win. I'm just suspecting it.
    At the very least, I believe that a custom preprocessor, not necessarily very evolved, but slightly higher level than cpp's string crunching, would prove very helpful for XPCom usage, with its mostly-but-not-quite-automatic reference counting, its somewhat-automated factories and module creation and component registration, its absence of practical error handling (that's actually my biggest issue with XPCom), etc.
    ###One problem is that GC everywhere would induce significant pause times and more real-time GCs have significant performance penalties.
    On this point, I'd invoke the urban legend/FUD. I've seen very few analysis of GC performance, but all of them were actually in favor of GC-ed languages for everything but the most simple cases. See Boehm's page, for instance.
    ###A bigger problem is that Gecko gets embedded in programs written in many languages (Java, Python, Mono/C#, C, C++) and if it had its own complex runtime that could get very messy.
    Well, I would say XPCom is already its own complex runtime. I mean, one would be hard-pressed to tell the difference between the Mozilla platform/XULRunner and an operating system. Am I wrong ?
    ###Another problem is that we'd want to ensure whatever infrastructure we used was well supported and perform well across a large variety of platforms
    True. But that's also the claim of just about every single programming platform on earth, including Java, Python, Mono, OCaml, Haskell, Ruby, Common Lisp, Smalltalk, Perl, (D)TAL, Erlang, Oz, LLVM, D, Objective-C, SML/NJ, Occam, and probably a thousand others. I strongly doubt that none of these languages would have been more appropriate than C++. I mean, for instance, most features of XPCom are language primitives in a number of these languages.
    Perhaps they are not as widely available as Mozilla -- I'm not even sure of that -- but I'm not certain that adopting a well-chosen but possibly less-common language, and helping to port it to obscure platforms if needed, would have been more work than adopting C++, designing XPCom, implementing XPCom, porting XPCom to various platforms, implementing assembly-level xptcall, etc.
    ###Once you've built all that, you really really want to reuse at least some of it.
    Fair enough. Still, I tend to believe that a Mozilla written mostly in Python and only progressively migrated to C++ would have been faster to build. Of course, I don't have your experience with Gecko.
    By the way, I also believe that at some point before Gecko 3, most of the Mozilla platform will have to be rewritten in a safer, more robust and more concurrent way. Safer as in better isolation of extension errors, better thread-safety, more fine-grained handling of priviledges, possibly with an internal threads manager, etc. Do you think this analysis is correct ? If so, will there still be a difference between Mozilla and a full OS+GUI ?
    P.S.:
    I may sound harsh, but I think you're all doing a great job. I just want to share my fears and suggestions.

    ReplyDelete
  8. It's only a myth that virtual method calls are costly. Almost all CPUs have direct machine instructions to do them. The additional cost is then reduced to ~2 CPU cycles per call.
    That's nothing compared to what they enable: OO-Polymorphism.
    In fact, *not* using them is a good example of why premature optimization is evil: Optimizing them away can be done at runtime with a JIT compiler (such as Java Hotspot). But adding them takes a developer to work through the entire code base (regardless of the language) and will make an API incompatible since the caller must know how to call a method/function/procedure.

    ReplyDelete
  9. Robert O'Callahan14 June 2007 19:46

    Sigh. Virtual method calls are costly because if the CPU fails to predict the target address --- which can easily happen a lot if you've got a lot of code, so the BTB overflows --- then the entire pipeline will be flushed. That's 20 cycles or more on many processors.
    Sure, a JIT would help a lot here, but we're don't have a C++ JIT.

    ReplyDelete
  10. anonymous coward6 September 2007 09:57

    Even though every virtual method will have a non-inlined copy of itself (even if for no reason but to serve as a runtime reference in v-table), the USE of a virtual call MAY (at times) be eliminated completely:
    struct A
    {
    virtual void go () {}
    };
    struct B
    {
    void go () {} // implicitly virtual
    };
    void go (A *a)
    {
    a->go ();
    }
    int main ()
    {
    B b;
    go (&b); // compiler has a chance to inline the whole lot and get rid of virtual call (a-> go())...
    return 0;
    }
    True, this may not be always the case as inlining is limited by both:
    - the fact that inlining all-the-time/everything will hurt performance (CPU caching issues of data-size vs code-size)
    - as well as 'inlined' method definitions being in inaccessible compilation units;
    ...but then, as per your quote:
    "Individual virtual calls are very cheap but used very frequently" - the more often one uses virtual calls, the more opportunities for the compiler to optimise a virtual call into non-virtual... in other words fix the problem where it is caused - compiler optimization stage... and define you code in header/template-like style... typing 'inline' does not do any inlining - only gives the compiler an opportunity to do so if it chooses to go that way... then give your compiler various commands to control how much inlining it will do during the build (e.g. for gcc: -finline-functions -finline-limit -finline-limit --param
    max-inline-insns-auto --param max-inline-insns-single -fstrength-reduce
    -floop-optimize --param max-average-unrolled-insns )

    ReplyDelete