Friday, 18 February 2005

Optimizing GNU ld


Optimizing the compile/edit/debug cycle for speed is very important, especially when you do a lot of debugging by inserting print statements, as I tend to do. Working on Mozilla's layout engine means that almost everything I do requires relinking of libgklayout.so, which is a mammoth shared library. With debug information, it's about 93MB in my current builds. Link times of over a minute are normal on lower-end machines. Apart from the basic delay, I hate to just stare at the screen during this time so I often go off web surfing and lose my engagement with what I'm working on. This is a real problem so I thought it would be worthwhile to invest some time in making gklayout linking faster.

The first things I did were to build ld from the latest CVS sources and construct a benchmark script that would link gklayout in a repeatable way and time how long it took. Actually I run it a few times and take the last number, so that all files are in memory and no I/O is required, since that's how it generally is when I'm in a debugging session. The first thing I found was a real shock: the ld I built from source linked gklayout in 15 seconds, whereas the system linker on my machine took 65 seconds. I don't really understand why but I've heard that certain recent ld versions had bugs that made it perform poorly on this workload, which showed up in Fedora Core 3 and were fixed by Red Hat in the very latest version. Anyway, the good news is that I was looking at a 4x speedup without having done anything. I could have stopped there but I thought I might as well keep digging for another day or two to see what else could be done.

Next I set up oprofile. Oprofile is great; some of the more esoteric features need to be ported to the new Xeons in my machine, so I can only use oprofile to measure time, but that was enough for this project. I looked where time was being spent in the ld binary and "opreport -l ~/bin/ld --symbols" showed that we were spending about 35% of the time stalled in the function walk_wild_sections. (BTW I'm going to gloss over all the time I spent going down blind alleys and figuring out how ld actually works :-).) Basically the problem is that there's a loop over a list of "wildcard specs"; some of the wildcards are just plain strings, and others have wildcard characters like '*', and they get handled differently. According to "opannotate --source --assembly ~/bin/ld", the conditional branch controlling which kind of string is present seemed to be very poorly predicted so we spent a lot of time there waiting for the pipeline to refill. And this is actually inside a huge loop where we iterate over all sections of all object files in the link, which is where ld seems to spend most of its time.

The first thing I did was to introduce special handling for the case when there's only one wildcard spec in the list, and it's not really wild, just a simple string. In that case we can actually find the relevant section in each file, if there is one, by looking up a hash table instead of iterating through all sections of the file. So not only do we eliminate the mispredicted branch, but we actually speed up things at a higher level. This sped the link up from 15 seconds to 12 seconds, a 20% improvement.

Further analysis showed that we still spent time stalled on the same instruction as before. I realized that the wildcard lists that are used by the default linker configuration have only a small number of different forms: one simple string, one wildcard string, one simple string and one wildcard string, one simple string and two wildcard strings, and two simple strings and two wildcard strings. I also noticed that no two elements of a wildcard list can ever match the same section name, which we can also use to speed things up. So I wrote specialized walk_wild_section routines for each of those five cases. These routines do not need to dynamically test whether a wildcard spec is a simple string or not, that's hard coded into the code. So we eliminate all the conditional branch stalls that were hurting us before. Sadly, that only produced an additional 5% improvement (down to about 11.3 seconds). That's why profiling is so important; real-world performance often confounds your expectations, especially when you're working this close to a complex microarchitecture.

Now it looked like most time was in bfd_hash_lookup, but it's called from many places and I couldn't find who the problematic caller(s) were. (Oprofile can do this on x86 but the code hasn't been ported to x86-64 yet. A project for another day perhaps.) So I did "poor man's profiling" by running in gdb, hitting ctrl-C periodically and looking at the stack backtrace. I was surprised to find myself spending most of the time doing UTF8 character conversion in glibc. Sure enough oprofile confirmed we spent a lot more time in glibc than in ld now. In fact we're doing UTF8 conversion because the glibc wildcard matching routines handle Unicode strings, which is complete overkill for ld section names! Furthermore almost all the wildcard patterns used in the default linker configuration are just a simple string followed by a single '*', which are trivial to match. So I specialized my special-case walk_wild_section functions even further to work just on those patterns, and hand-wrote a small inlineable match function that does no character set conversion (but will still work if section names and wildcard specs happen to be UTF8). The results are astounding: link time is down to 5 seconds, a 3x improvement since I started my work, and a 12x improvement over what I'm used to!

At the beginning I was thinking I might have to do something really clever to parallelize ld across multiple processors, but now that just isn't worth it, not for this workload anyway.

I'll try to get these patches fed upstream so all ld users can take advantage of these optimizations.


7 comments:

  1. Great work! Give us a link when you get the patches in a bug or some such place. Thanks,
    /be

    ReplyDelete
  2. Just curious: How much time did you spent on this? :-)

    ReplyDelete
  3. Robert O'Callahan19 February 2005 03:23

    It took a couple of days.

    ReplyDelete
  4. If you haven't seen them, you should really play with the CHUD tools on Mac OS X. You can sample any process without instrumenting it (which always introduces bias), and the tools let you poke at every level from the library level to the assembly. The Shark tool is most impressive. Check out:

    ReplyDelete
  5. Robert O'Callahan21 February 2005 01:48

    I've heard the OSX tools are great. Now that I have a Mac at home, I'll try them. At work I'll have to stick with oprofile, which is fine because oprofile is pretty good (low-overhead sampling, kernel and user, stack traces, etc).

    ReplyDelete
  6. Awesome work. I've been thinking about ld recently myself, as I have a project that takes 3 or 4 minutes to link on an Athlon-MP 1800. I'm keen to try out this work and see how much a difference it makes.

    ReplyDelete