Monday, 11 February 2019

Rust's Affine Types Catch An Interesting Bug

A function synchronously downloads a resource from Amazon S3 using a single GetObject request. I want it to automatically retry the download if there's a network error. A wrapper function aws_retry_sync based on futures-retry takes a closure and automatically reruns it if necessary, so the new code looks like this:

pub fn s3_download<W: Write>(
    client: S3Client,
    bucket: String,
    key: String,
    out: W,
) -> io::Result<()> {
    aws_retry_sync(move || {
        let response = client.get_object(...).sync()?;
        if let Some(body) = response.body {
            body.fold(out, |mut out, bytes: Vec| -> io::Result {
                out.write_all(&bytes)?;
                Ok(out)
            })
            .wait()?;
        }
    })
}
This fails to compile for an excellent reason:
error[E0507]: cannot move out of captured variable in an `FnMut` closure
   --> aws-utils/src/lib.rs:194:23
    |
185 |     out: W,
    |     --- captured outer variable
...
194 |             body.fold(out, |mut out, bytes: Vec| -> io::Result {
    |                       ^^^ cannot move out of captured variable in an `FnMut` closure
I.e., the closure can execute more than once, but each time it executes it wants to take ownership of out. Imagine if this compiled ... then if the closure runs once and writes N bytes to out, then the network connection fails and we retry successfully, we would write those N bytes to out again followed by the rest of the data. This would be a subtle and hard to reproduce error.

A retry closure should not have side effects for failed operations and should not, therefore, take ownership of out at all. Instead it should capture data to a buffer which we'll write to out if and only if the entire fetch succeeds. (For large S3 downloads you need parallel downloads of separate ranges, so that network errors only require refetching part of the object, and that approach deserves a separate implementation.)

Ownership types are for more than just memory and thread safety.

Mt Taranaki 2019

Last weekend I climbed Mt Taranaki again. Last time was just me and my kids, but this weekend I had a larger group of ten people — one of my kids and a number of friends from church and elsewhere. We had a range of ages and fitness levels but everyone else was younger than me and we had plans in place in case anyone needed to turn back.

We went this weekend because the weather forecast was excellent. We tried to start the walk at dawn on Saturday but were delayed because the North Egmont Visitor's Centre carpark apparently filled up at 4:30am; everyone arriving after that had to park at the nearest cafe and catch a shuttle to the visitor's centre, so we didn't start until 7:40am.

In short: we had a long hard day, as expected, but everyone made it to the crater, most of us by 12:30pm. Most of our group clambered up to the very summit, and we all made it back safely. Unfortunately clouds set in around the top not long before we go there so there wasn't much of a view, but we had good views much of the rest of the time. You could clearly see Ruapehu, Ngauruhoe and Tongariro to the east, 180km away. It was a really great day. The last of our group got back to the visitor's centre around 6pm.

My kid is six years older than last time and much more experienced at tramping, so this time he was actually the fastest of our entire group. I'm proud of him. I think I found it harder than last time — probably just age. As I got near the summit my knees started to twinge and cramp if I wasn't careful on the big steps up. I was also a bit shorter of breath than I remember from last time. I was faster at going down the scree slope though, definitely the trickiest part of the descent.

On the drive back from New Plymouth yesterday, the part of the group in our car stopped at the "Three Sisters", rock formations on the beach near Highway 3 along the coast. I just saw it on the map and we didn't know what was there, but it turned out to be brilliant. We had a relaxing walk and the beach, surf, rocks and sea-caves were beautiful. Highly recommended — but you need to be there around low tide to walk along the riverbank to the beach and through the caves.

Sunday, 27 January 2019

Experimental Data On Reproducing Intermittent MongoDB Test Failures With rr Chaos Mode

Max Hirschhorn from MongoDB has released some very interesting results from an experiment reproducing intermittent MongoDB test failures using rr chaos mode.

He collected 18 intermittent test failure issues and tried running them 1000 times under the test harness and rr with and without chaos mode. He noted that for 13 of these failures, MongoDB developers were able to make them reproducible on demand with manual study of the failure and trial-and-error insertion of "sleep" calls at relevant points in the code.

Unfortunately rr didn't reproduce any of his 5 not-manually-reproducible failures. However, it did reproduce 9 of the 13 manually reproduced failures. Doing many test runs under rr chaos mode is a lot less developer effort than the manual method, so it's probably a good idea to try running under rr first.

Of the 9 failures reproducible under rr, 3 also reproduced at least once in a 1000 runs without rr (with frequencies 1, 3 and 54). Of course with such low reproduction rates those failures would still be pretty hard to debug with a regular debugger or logging.

The data also shows that rr chaos mode is really effective: in almost all cases where he measured chaos mode vs rr non-chaos or running without rr, rr chaos mode dramatically increased the failure reproduction rate.

The data has some gaps but I think it's particularly valuable because it's been gathered on real-world test failures on an important real-world system, in an application domain where I think rr hasn't been used before. Max has no reason to favour rr, and I had no interaction with him between the start of the experiment and the end. As far as I know there's been no tweaking of rr and no cherry-picking of test cases.

I plan to look into the failures that rr was unable to reproduce to see if we can improve chaos mode to catch them and others like them in the future. He hit at least one rr bug as well.

I've collated the data for easier analysis here:

FailureReproduced manuallyrr-chaos reproductionsregular rr reproductionsno-rr reproductions
BF-9810--0 /1000??
BF-9958Yes71 /10002 /10000 /1000
BF-10932Yes191 /10000 /10000 /1000
BF-10742Yes97 /10000 /10000 /1000
BF-6346Yes0 /10000 /10000 /1000
BF-8424Yes1 /2321 /9730 /1000
BF-7114Yes0 /48??
BF-7588Yes193 /100096 /100054 /1000
BF-7888Yes0 /1000??
BF-8258--0 /636??
BF-8642Yes3 /1000?0 /1000
BF-9248Yes0 /1000??
BF-9426--0 /1000??
BF-9552Yes5 /563??
BF-9864--0 /687??
BF-10729Yes2 /1000?1 /1000
BF-11054Yes7 /1000?3 /1000