Friday 9 June 2017
Another Case Of Obscure CPU Nondeterminism
One of "big bets" of rr is that commodity CPUs running user-space code really are deterministic in practice, or at least the nondeterminism can be efficiently detected and controlled, without restorting to pervasive binary instrumentation. We're mostly winning that bet on x86 (but not ARM), but I'm uncomfortable knowing that otherwise-benign architectural changes could break us at any time. (One reason I'm keen to increase rr's user base is make system designers care about not breaking us.)
I recently discovered the obscure XGETBV instruction. This is a problematic instruction for rr because it returns the contents of internal system registers that we can't guarantee will be unchanged from run to run. (It's a bit weird that the contents of these registers are allowed to leak to userspace, but anyway...) Fortunately it's barely used in practice; the only use I've seen of it so far is by the dynamic loader ld.so, to return the contents of the XINUSE register. This obscure register tracks, for each task, whether certain CPU components are known to be in their default states. For example if your x86-64 process has never used any (legacy floating-point) x87 instructions, bit 0 of XINUSE should be clear. That should be OK for rr, since whether certain kinds of instructions have executed should be the same between recording and replay. Unfortunately I have observed cases where the has-used-x87 bit gets unexpectedly flipped on; by singlestepping the tracee it's pretty clear the bit is being set in code that has nothing to do with x87, and the point where it is set, or if it happens at all, varies from run to run. (I have no way to tell whether the bit is being set by hardware or the kernel.) This is disturbing because it means that XGETBV is effectively a nondeterministic instruction. Fortunately, the ld.so users are just testing to see whether AVX has been used, so they mask off the x87 bit almost immediately, eliminating the nondeterminism from the program state before it can do any damage (well, unless you were super unlucky and took a context switch between XGETBV and the mask operation). I haven't seen any bits other than the x87 bit being unexpectedly set.
Fortunately it seems we can work around the problem completely in rr by setting the x87-in-use bit immediately after every exec. If the bit is always set, it doesn't matter if something intermittently sets it again. Another day, another bullet dodged!
Comments