Wednesday, 5 April 2017

Rust Optimizations That C++ Can't Do (Version 2)

My last post was wrong in the details, so let me show a better example. Rust code:

pub extern fn foo(v: &u64, callback: fn()) -> u64 {
  let mut sum = 0;
  for _ in 0..100 {
    sum += *v;
C++ version here.

The inner loop of the Rust version turns into:

        inc     ebx
        call    r14
        add     r12, r15
        cmp     ebx, 100
        jl      .LBB0_1
sum is in r12 and the evaluation of *v has been hoisted out of the loop with the result in r15. In the compiled C++ code the value has not been hoisted:
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rbx, qword ptr [r15]
        call    r14
        dec     ebp
        jne     .LBB0_1

Of course the performance difference in this case would be negligible, but if *v was replaced with a more complicated expression, the difference would increase. (No, I don't know why the Rust code counts up instead of counting down like C++ to get slightly better code; someone should fix that.)

Here it's really clear that the semantics of Rust are making this optimization possible. In C++ v could be a reference to a global variable which is modified by the callback function, in which case hoisting the load would be incorrect.

Several people pointed out that with LTO/WPO a C++ compiler might be able to determine that such cases don't happen, in this case that no function passed into callback directly or indirectly modifies v. That's what I meant in my previous post by "a C++ compiler could do some analysis to show that can't happen" ... but as I continued, "those analyses don't scale". The larger and more complicated your code gets, the more likely it is that static analysis will fail to prove the invariants you care about. Sometimes, due to dynamic linking or run-time code generation, relevant code just isn't available to be analyzed. In other cases the analysis algorithms aren't sophisticated enough to prove what you care about (often because no such algorithm is known), or adequate algorithms exist but they were too expensive to deploy. Perhaps the worst cases for sophisticated global analysis are when the property you care about isn't even true, because you accidentally violated it in some part of the program, but you'll never know, because the only effect of that accident is that performance got a little bit worse somewhere else. (E.g. with this example if you accidentally change one of the callback functions to indirectly modify v, your hypothetical WPO compiler does a slightly worse job and you won't know why.)

What we need are language-level invariants that apply globally and support aggressive optimization using only local analysis. C++ has some, but Rust's are stronger. We also need those invariants to be checked and enforced, not just "it's undefined behavior if you do this"; (safe) Rust wins again on that.


  1. Sorry to say, but your articles (last two) are entirely pointless.

    The C++ version of the Rust code is:
    const long& __restrict v

    and the result is better than Rust.

    1. "restrict" (technically, the vendor "__restrict__" extensions to C++) makes the optimization work for C++ in this case, but as I said at the end, 'we also need those invariants to be checked and enforced, not just "it's undefined behavior if you do this"'. "restrict" fails in that regard.

      Looking at it another way, to make "restrict" give you what Rust has, you'd have to use it everywhere and on struct/class members too --- but that's not idiomatic C++ at all. Not only is "restrict" not in the C++ standard, but also such broad usage would make it very difficult to avoid triggering undefined behavior.

  2. Have a peek at what happens if you take "v" by value instead.

    1. For reference :

      There is no point passign this long by reference. The whoel premises of this post is shaky.

  3. Will rustc carry out the hoisting optimization for more complex types? For such simple types C++ you'd pass by value rather than reference-to-const, and this generates much faster code:

    1. > Will rustc carry out the hoisting optimization for more complex types?

      Yes: (C++: