Saturday 12 May 2018
rr's chaos mode introduces nondeterminism while recording application execution, to try to make intermittent bugs more reproducible. I'm always interested in hearing about bugs that cannot be reproduced under chaos mode, especially if those bugs have been diagnosed. If we can figure out why a bug was not reproducible under chaos mode, we can often extend chaos mode to make it reproducible, and this improves chaos mode for everyone. If you encounter such a bug, please file an rr issue about it.
I just landed one such improvement. To trigger a specific Spidermonkey JS engine bug, some thread X had to do a FUTEX_WAKE to wake up thread Y, then immediately yield to let thread Y run for a while without X running any further. rr chaos mode assigns random priorities to threads and strictly adheres to them, so in some runs it would assign X a low priority and Y a high priority and schedule Y whenever both were runnable. However, rr's syscall buffering optimization means the rr supervisor process is not notified after the FUTEX_WAKE and has no opportunity to interrupt X and schedule Y instead, so we keep running the lower-priority X thread, violating our scheduling policy. (Chaos mode randomizes scheduling intervals so it was possible for X to yield at the right point, but very unlikely because the "window of vulnerability" is very small.) The fix is quite easy: in chaos mode, FUTEX_WAKE should not use the syscall buffering optimization. This adds some overhead, but hopefully not all that much, because every FUTEX_WAKE is normally paired with a FUTEX_WAIT (futex-using code should not issue a FUTEX_WAKE if there are no waiters), and a FUTEX_WAIT yields, which is already an expensive operation.
The same sorts of issues exist for other system calls that can make another higher-priority thread runnable, and I've added some slightly more elaborate fixes for those.
One day I should do a proper evaluation of these techniques and publish them...