Thursday, 11 April 2019

Mysteriously Low Hanging Fruit: A Big Improvement To LLD For Rust Debug Builds

LLD is generally much faster than the GNU ld.bfd and ld.gold linkers, so you would think it has been pretty well optimised. You might then be surprised to discover that a 36-line patch dramatically speeds up linking of Rust debug builds, while also shrinking the generated binaries dramatically, both in simple examples and large real-world projects.

The basic issue is that the modern approach to eliminating unused functions from linked libraries, --gc-sections, is not generally able to remove the DWARF debug info associated with the eliminated functions. With --gc-sections the compiler puts each function in its own independently linkable ELF section, and then the linker is responsible for selecting only the "reachable" functions to be linked into the final executable and discarding the rest. However, compilers are still putting the DWARF debug info into a single section per compilation unit, and linkers mostly treat debug info sections as indivisible black boxes, so those sections get copied into the final executable even if the functions they're providing debug info for have been discarded. My patch tackles the simplest case: when a compilation unit has had all its functions and data discarded, discard the debug info sections for that unit. Debug info could be shrunk a lot more if the linker was able to rewrite the DWARF sections to discard info for a subset of the functions in a compilation unit, but that would be a lot more work to implement (and would potentially involve performance tradeoffs). Even so, the results of my patch are good: for Pernosco, our "dist" binaries with debug info shrink from 2.9GB to 2.0GB.

Not only was the patch small, it was also pretty easy to implement. I went from never having looked at LLD to working code in an afternoon. So an interesting question is, why wasn't this done years ago? I can think of a few contributing reasons:

People just expect binaries with debug info to be bloated, and because they're only used for debugging, except for a few people working on Linux distros, it's not worth spending much effort trying to shrink them.

C/C++ libraries that expect to be statically linked, especially common ones like glibc, don't rely on --gc-sections to discard unused functions. Instead, they split the library into many small compilation units, ideally one per independently usable function. This is extra work for library developers, but it solves the debug info problem. Rust developers don't (and really, can't) do this because rustc splits crates into compilation units in a way that isn't under the control of the developer. Less work for developers is good, so I don't think Rust should change this; tools need to keep up.

Big companies that contribute to LLD, with big projects that statically link third-party libraries, often "vendor" those libraries, copying the library source code into their big project and building it as part of that project. As part of that process, they would usually tweak the library to only build the parts their project uses, avoiding the problem.

There has been tension in the LLD community between doing the simple thing I did and doing something more difficult and complex involving DWARF rewriting, which would have greater returns. Perhaps my patch submission to some extent forced the issue.

Friday, 5 April 2019

Rust Discussion At IFP WG2.4

I've spent this week at a IFIP WG2.4 meeting, where researchers share ideas and discuss topics in programming languages, analysis and software systems. The meeting has been in Paihia in the Bay of Islands, so very conveniently located for me. My main talk was about Pernosco, but I also took the opportunity to introduce people to Rust and the very significant advances in programming language technology that it delivers. My slides are rudimentary because I wanted to minimize my talking and leave plenty of time for questions and discussion. I think it went pretty well. The main point I wanted researchers to internalize is that Rust provides a lot of structure that could potentially be exploited by static analysis and other kinds of tools, and that we should expect future systems programming languages to at least meet the bar set by Rust, so forward-looking research should try to exploit these properties. I think Rust's tight control of aliasing is especially important because aliasing is still such a problematic issue for all kinds of static analysis techniques. The audience seemed receptive.

One person asked me whether they should be teaching Rust instead of C for their "systems programming" courses. I definitely think so. I wouldn't teach Rust as a first programming language, but for a more advanced course focusing on systems programming I think Rust would be a great way to force people to think about issues such as lifetimes — issues that C programmers should grapple with but can often get away with sloppy handling of in classroom exercises.

Saturday, 30 March 2019

Marama Davidson And The Truth About Auckland's History

On March 24 I sent the following email to Marama Davidson's parliamentary office email address.

Subject: Question about Ms Davidson's speech at the Auckland peace rally on March 16 I was at the rally. During her speech Ms Davidson mentioned that the very land we were standing on (Aotea Square) was taken from Māori by European settlers by force. However Wikipedia says
By 1840 Te Kawau had become the paramount chief of Ngāti Whātua. Cautious of reprisals from the Ngāpuhi defeated at Matakitaki, Te Kawau found it most convenient to offer Governor Hobson land around the present central city.
https://en.wikipedia.org/wiki/History_of_Auckland Can you clarify Ms Davidson's statement and/or provide a source for her version? Sincerely, Robert O'Callahan

I haven't received a response. Te Ara agrees with Wikipedia.

I'd genuinely like to know the truth here. It would be disappointing if Davidson lied — blithely accepting "all politicians lie" is part of the path to electing people like Donald Trump. On the other hand if the official histories are wrong, that would also be disappointing and they need to be corrected.

Monday, 18 February 2019

Banning Huawei Is The Right Decision

If China's dictator-for-life Xi Jinping orders Huawei to support Chinese government spying, it's impossible to imagine Huawei resisting. The Chinese government flaunts its ability to detain anyone at any time for any reason.

The argument "no-one has caught Huawei doing anything wrong" (other than stealing technology) misses the point; the concern is about what they might do in the future.

The idea that you can buy equipment from Huawei today and protect it from future hijacking doesn't work. It will need to be maintained and upgraded by Huawei, which will let them add backdoors in the future even if there aren't any (accidental or deliberate) today.

Don't imagine you can inspect their systems to find backdoors. Skilled engineers can insert practically undetectable backdoors at many different levels of a computer system.

These same issues apply to other Chinese technology companies.

These same issues apply to technology companies from other countries, but New Zealand should worry less about technology companies from Western powers. Almost every developed country has much greater rule of law than China has; for example US spy agencies can force tech companies to cooperate using National Security Letters, but those can be challenged in court. We also have to weigh how much we fear the influence of different governments. I think New Zealand should worry a lot less about historically friendly democracies, flawed as they are, than about a ruthless tyranny like the Chinese government with a history of offensive cyberwarfare.

New Zealand and other countries may pay an economic price for such decisions, and I can see scenarios where the Chinese government decides to make an example of us to try to frighten other nations into line. Hopefully that won't happen and we won't be forced to choose between friendship with China and digital sovereignty — but if we have to pick one, we'd better pick digital sovereignty.

It would be easier for Western countries to take the right stand if the US President didn't fawn over dictators, spit on traditional US allies, and impose tariffs on us for no good reason.

Monday, 11 February 2019

Rust's Affine Types Catch An Interesting Bug

A function synchronously downloads a resource from Amazon S3 using a single GetObject request. I want it to automatically retry the download if there's a network error. A wrapper function aws_retry_sync based on futures-retry takes a closure and automatically reruns it if necessary, so the new code looks like this:

pub fn s3_download<W: Write>(
    client: S3Client,
    bucket: String,
    key: String,
    out: W,
) -> io::Result<()> {
    aws_retry_sync(move || {
        let response = client.get_object(...).sync()?;
        if let Some(body) = response.body {
            body.fold(out, |mut out, bytes: Vec| -> io::Result {
                out.write_all(&bytes)?;
                Ok(out)
            })
            .wait()?;
        }
    })
}
This fails to compile for an excellent reason:
error[E0507]: cannot move out of captured variable in an `FnMut` closure
   --> aws-utils/src/lib.rs:194:23
    |
185 |     out: W,
    |     --- captured outer variable
...
194 |             body.fold(out, |mut out, bytes: Vec| -> io::Result {
    |                       ^^^ cannot move out of captured variable in an `FnMut` closure
I.e., the closure can execute more than once, but each time it executes it wants to take ownership of out. Imagine if this compiled ... then if the closure runs once and writes N bytes to out, then the network connection fails and we retry successfully, we would write those N bytes to out again followed by the rest of the data. This would be a subtle and hard to reproduce error.

A retry closure should not have side effects for failed operations and should not, therefore, take ownership of out at all. Instead it should capture data to a buffer which we'll write to out if and only if the entire fetch succeeds. (For large S3 downloads you need parallel downloads of separate ranges, so that network errors only require refetching part of the object, and that approach deserves a separate implementation.)

Ownership types are for more than just memory and thread safety.

Mt Taranaki 2019

Last weekend I climbed Mt Taranaki again. Last time was just me and my kids, but this weekend I had a larger group of ten people — one of my kids and a number of friends from church and elsewhere. We had a range of ages and fitness levels but everyone else was younger than me and we had plans in place in case anyone needed to turn back.

We went this weekend because the weather forecast was excellent. We tried to start the walk at dawn on Saturday but were delayed because the North Egmont Visitor's Centre carpark apparently filled up at 4:30am; everyone arriving after that had to park at the nearest cafe and catch a shuttle to the visitor's centre, so we didn't start until 7:40am.

In short: we had a long hard day, as expected, but everyone made it to the crater, most of us by 12:30pm. Most of our group clambered up to the very summit, and we all made it back safely. Unfortunately clouds set in around the top not long before we go there so there wasn't much of a view, but we had good views much of the rest of the time. You could clearly see Ruapehu, Ngauruhoe and Tongariro to the east, 180km away. It was a really great day. The last of our group got back to the visitor's centre around 6pm.

My kid is six years older than last time and much more experienced at tramping, so this time he was actually the fastest of our entire group. I'm proud of him. I think I found it harder than last time — probably just age. As I got near the summit my knees started to twinge and cramp if I wasn't careful on the big steps up. I was also a bit shorter of breath than I remember from last time. I was faster at going down the scree slope though, definitely the trickiest part of the descent.

On the drive back from New Plymouth yesterday, the part of the group in our car stopped at the "Three Sisters", rock formations on the beach near Highway 3 along the coast. I just saw it on the map and we didn't know what was there, but it turned out to be brilliant. We had a relaxing walk and the beach, surf, rocks and sea-caves were beautiful. Highly recommended — but you need to be there around low tide to walk along the riverbank to the beach and through the caves.

Sunday, 27 January 2019

Experimental Data On Reproducing Intermittent MongoDB Test Failures With rr Chaos Mode

Max Hirschhorn from MongoDB has released some very interesting results from an experiment reproducing intermittent MongoDB test failures using rr chaos mode.

He collected 18 intermittent test failure issues and tried running them 1000 times under the test harness and rr with and without chaos mode. He noted that for 13 of these failures, MongoDB developers were able to make them reproducible on demand with manual study of the failure and trial-and-error insertion of "sleep" calls at relevant points in the code.

Unfortunately rr didn't reproduce any of his 5 not-manually-reproducible failures. However, it did reproduce 9 of the 13 manually reproduced failures. Doing many test runs under rr chaos mode is a lot less developer effort than the manual method, so it's probably a good idea to try running under rr first.

Of the 9 failures reproducible under rr, 3 also reproduced at least once in a 1000 runs without rr (with frequencies 1, 3 and 54). Of course with such low reproduction rates those failures would still be pretty hard to debug with a regular debugger or logging.

The data also shows that rr chaos mode is really effective: in almost all cases where he measured chaos mode vs rr non-chaos or running without rr, rr chaos mode dramatically increased the failure reproduction rate.

The data has some gaps but I think it's particularly valuable because it's been gathered on real-world test failures on an important real-world system, in an application domain where I think rr hasn't been used before. Max has no reason to favour rr, and I had no interaction with him between the start of the experiment and the end. As far as I know there's been no tweaking of rr and no cherry-picking of test cases.

I plan to look into the failures that rr was unable to reproduce to see if we can improve chaos mode to catch them and others like them in the future. He hit at least one rr bug as well.

I've collated the data for easier analysis here:

FailureReproduced manuallyrr-chaos reproductionsregular rr reproductionsno-rr reproductions
BF-9810--0 /1000??
BF-9958Yes71 /10002 /10000 /1000
BF-10932Yes191 /10000 /10000 /1000
BF-10742Yes97 /10000 /10000 /1000
BF-6346Yes0 /10000 /10000 /1000
BF-8424Yes1 /2321 /9730 /1000
BF-7114Yes0 /48??
BF-7588Yes193 /100096 /100054 /1000
BF-7888Yes0 /1000??
BF-8258--0 /636??
BF-8642Yes3 /1000?0 /1000
BF-9248Yes0 /1000??
BF-9426--0 /1000??
BF-9552Yes5 /563??
BF-9864--0 /687??
BF-10729Yes2 /1000?1 /1000
BF-11054Yes7 /1000?3 /1000