Saturday 26 January 2019
Experimental Data On Reproducing Intermittent MongoDB Test Failures With rr Chaos Mode
Max Hirschhorn from MongoDB has released some very interesting results from an experiment reproducing intermittent MongoDB test failures using rr chaos mode.
He collected 18 intermittent test failure issues and tried running them 1000 times under the test harness and rr with and without chaos mode. He noted that for 13 of these failures, MongoDB developers were able to make them reproducible on demand with manual study of the failure and trial-and-error insertion of "sleep" calls at relevant points in the code.
Unfortunately rr didn't reproduce any of his 5 not-manually-reproducible failures. However, it did reproduce 9 of the 13 manually reproduced failures. Doing many test runs under rr chaos mode is a lot less developer effort than the manual method, so it's probably a good idea to try running under rr first.
Of the 9 failures reproducible under rr, 3 also reproduced at least once in a 1000 runs without rr (with frequencies 1, 3 and 54). Of course with such low reproduction rates those failures would still be pretty hard to debug with a regular debugger or logging.
The data also shows that rr chaos mode is really effective: in almost all cases where he measured chaos mode vs rr non-chaos or running without rr, rr chaos mode dramatically increased the failure reproduction rate.
The data has some gaps but I think it's particularly valuable because it's been gathered on real-world test failures on an important real-world system, in an application domain where I think rr hasn't been used before. Max has no reason to favour rr, and I had no interaction with him between the start of the experiment and the end. As far as I know there's been no tweaking of rr and no cherry-picking of test cases.
I plan to look into the failures that rr was unable to reproduce to see if we can improve chaos mode to catch them and others like them in the future. He hit at least one rr bug as well.
I've collated the data for easier analysis here:
Failure | Reproduced manually | rr-chaos reproductions | regular rr reproductions | no-rr reproductions |
BF-9810 | -- | 0 /1000 | ? | ? |
BF-9958 | Yes | 71 /1000 | 2 /1000 | 0 /1000 |
BF-10932 | Yes | 191 /1000 | 0 /1000 | 0 /1000 |
BF-10742 | Yes | 97 /1000 | 0 /1000 | 0 /1000 |
BF-6346 | Yes | 0 /1000 | 0 /1000 | 0 /1000 |
BF-8424 | Yes | 1 /232 | 1 /973 | 0 /1000 |
BF-7114 | Yes | 0 /48 | ? | ? |
BF-7588 | Yes | 193 /1000 | 96 /1000 | 54 /1000 |
BF-7888 | Yes | 0 /1000 | ? | ? |
BF-8258 | -- | 0 /636 | ? | ? |
BF-8642 | Yes | 3 /1000 | ? | 0 /1000 |
BF-9248 | Yes | 0 /1000 | ? | ? |
BF-9426 | -- | 0 /1000 | ? | ? |
BF-9552 | Yes | 5 /563 | ? | ? |
BF-9864 | -- | 0 /687 | ? | ? |
BF-10729 | Yes | 2 /1000 | ? | 1 /1000 |
BF-11054 | Yes | 7 /1000 | ? | 3 /1000 |