Wednesday 29 May 2019
This upcoming PLDI paper is cool. One thing I like about it is that it does a detailed comparison against rr, and a fair comparison too. The problem of reproducing race bugs using randomized scheduling in a record-and-replay setting is important, and the paper has interesting quantitative results.
It's unfortunate that the paper doesn't mention rr's chaos mode, which is our attempt to tackle roughly the same problem. It would be very interesting to compare chaos mode to the approach in this paper on the same or similar benchmarks.
I'm quite surprised that the PLDI reviewers accepted this paper. I don't mean that the paper is poor, because I think it's actually quite good. We submitted papers about rr to several conferences including PLDI (until USENIX ATC accepted it), and we consistently got quite strong negative review comments that it wasn't clear enough which programs rr would record and replay successfully, and what properties of the execution were guaranteed to be preserved during the replay. We described many steps we had to take to get applications to record efficiently in rr in practice, and many reviewers seemed to perceive rr as just a collection of hacks and thus not publishable. Yet it seems to me this "sparse replay" approach is considerably more vague than rr about what it can handle and what gets preserved during replay. I do not see any principled reason why the critical reviewers of our rr paper would not have criticised this paper even harder. I wonder what led to a different outcome.
Perhaps making the idea of "sparse replay" (i.e., record only some subset of behaviour that's necessary and sufficient for a particular application) a focus of the paper effectively lampshaded the problem, or just sufficiently reduced expectations by not claiming to be a general-purpose tool.
I also suspect it's partly just "luck of the draw" in reviewer assignment. It is an unfortunate fact that paper review outcomes can be pretty random. As both a submitter and reviewer, I've seen that scores from different reviewers often differ wildly — it's not uncommon for a paper to get both A and D reviews on an A-to-D scale. When a paper gets both A and D, it typically gets a lot more scrutiny from the review committee to reach a decision, but one should also expect that there are many (un)lucky papers that just happen to avoid a D reviewer or fail to connect with an A reviewer. Given how important publications are to many people (fortunately, not to me), it's not a great system. Though, like democracy, maybe it's better than the others.