I investigated an rr bug report and discovered an annoying Intel CPU bug that affects rr replay using data watchpoints. It doesn't seem to be hit very often in practice, which is good because I don't know any way to work around it. It turns out that the bug is probably covered by an existing Intel erratum for Skylake and Kaby Lake (and probably later generations, but I'm not sure), which I even blogged about previously! However, the erratum does not mention watchpoints and the bug I've found definitely depends on data watchpoints being set.
I was able to write a stand-alone testcase to characterize the bug. The issue seems to be that if a rep stos (and probably rep movs) instruction writes between 1 and 64 bytes (inclusive), and you have a read or write watchpoint in the range [64, 128) bytes from the start of the writes (i.e., not triggered by the instruction), then one spurious retired conditional branch is (usually) counted. The alignment of the writes does not matter, and it's not related to speculative execution.
If you find rr failing during replay with watchpoints set, and the failures go away if you remove the watchpoints, it could well be this bug. Broadwell and earlier don't seem to have the bug.
A possible workaround would be to disable "fast-string optimization" in the kernel at boot time. I don't think there's any way for users to force this to happen in Linux, currently, but someone could write a kernel patch adding a command-line option for that and send it upstream. It would be great if they did!
Fortunately this sort of bug does not affect Pernosco.